Benchmarks? - Githubissues

Farama-Foundation / Arcade-Learning-Environment

The Arcade Learning Environment (ALE) -- a platform for AI research.

GNU General Public License v2.0

2.12k stars 420 forks source link

Benchmarks? #442

Closed slerman12 closed 2 years ago

slerman12 commented 2 years ago

Hi, thanks so much for maintaining this repo. I just want to ask if there are any existing benchmark datasets for v5? I know rliable is a great code base for standardized benchmarking in RL, but I'm not sure if they have v5 benchmarks using the best practices recommended in Atari Revisited.

agarwl commented 2 years ago

Thanks @mgbellemare for pointing me to this.

I'm not sure about v5 but rliable reported results for ALE agents with sticky actions implemented using Dopamine (which follow the recommendations by Atari Revisited) in bit.ly/statistical_precipice_colab.

Additionally, you can download the individual scores for 8 agents : DQN (Nature), REM, DQN (Adam), QR-DQN, IQN, Rainbow, DreamerV2 and M-IQN from https://console.cloud.google.com/storage/browser/rl-benchmark-data/ALE. See the linked colab above for details.

Please let me know if there's something else you want to know.

slerman12 commented 2 years ago

One thing I noticed — Atari Krull seems to be heavily misreported... the Random agent supposedly achieves ~1600 scores (just short of human), but both in my experiments and the reported results for DER, that seems higher than right. Not sure, I've only looked at the distribution for DER and my own experiments.

agarwl commented 2 years ago

Regarding random scores, these are scores obtained by the random agent as reported in original DQN (Nature) paper (see Table 2) on Atari games and are used as is by almost all existing publications.

That said, I think these scores are based on deterministic Atari games without sticky actions. The random agent scores with sticky actions are likely to be lower. Let me know if you have any other questions.