Cannot reproduce DQN Breakout baseline

sytelus commented 4 years ago

I'm excited by the "stable" promise of stable-baseline but currently, I'm not able to reproduce DQN results for Breakout. It is well known that you should get 300+ score in Breakout with DQN and this can be confirmed by monitor.csv in benchmark.zip in this repo. Coincidently, OpenAI Baseline is also broken for DQN/Breakout. I'm suspecting their bug has also impacted stable-baselines.

Here are my results:

python train.py --algo dqn --env BreakoutNoFrameskip-v4

Tensotboard curve:

Last 3 stdout log:

--------------------------------------
| % time spent exploring  | 1        |
| episodes                | 56300    |
| mean 100 episode reward | 8.9      |
| steps                   | 9940866  |
--------------------------------------
--------------------------------------
| % time spent exploring  | 1        |
| episodes                | 56400    |
| mean 100 episode reward | 8.4      |
| steps                   | 9967305  |
--------------------------------------
--------------------------------------
| % time spent exploring  | 1        |
| episodes                | 56500    |
| mean 100 episode reward | 8.5      |
| steps                   | 9993810  |
--------------------------------------

As we can see training does not converge and reward stay stuck at 8.5 and sometimes randomly picking at upto 22, still well below expected 300+.

araffin commented 4 years ago

Hello, Because of Atari preprocessing, the reported reward is not the real one, did you evaluate it using the enjoy.py script? (cf here)

PS: please also fill the issue template completely (notably the packages version and os)

araffin commented 4 years ago

For reference, here is the current result of the trained agent using 5000 test steps:

python enjoy.py --algo dqn --env BreakoutNoFrameskip-v4 --no-render -n 5000
Using Atari wrapper
Episode Reward: 4.00
Episode Length 154
Episode Reward: 6.00
Episode Length 203
Episode Reward: 10.00
Episode Length 269
Episode Reward: 13.00
Episode Length 331
Atari Episode Score: 65.00
Atari Episode Length 1051
Episode Reward: 2.00
Episode Length 68
Episode Reward: 41.00
Episode Length 1045
Episode Reward: 2.00
Episode Length 72
Episode Reward: 40.00
Episode Length 575
Episode Reward: 0.00
Episode Length 18
Atari Episode Score: 308.00
Atari Episode Length 1759
Episode Reward: 0.00
Episode Length 20
Episode Reward: 41.00
Episode Length 1045
Episode Reward: 2.00
Episode Length 72
Episode Reward: 40.00
Episode Length 575
Episode Reward: 0.00
Episode Length 18
Atari Episode Score: 308.00
Atari Episode Length 1759
Episode Reward: 0.00
Episode Length 20
Mean reward: 13.40

As you can see the mean reward is around 10, but the corresponding score is much higher

EDIT: I reactivated the episode reward print to generate this

araffin commented 4 years ago

More explanation: https://github.com/openai/baselines/issues/667

EDIT: If you look at the learning curve here the current DQN trained agent matches the previous performance (mean score around 200)

sytelus commented 4 years ago

I'm training from scratch using current code available in stable-baseline as well as rl-baselines-zoo. With enjoy.py I get this:

Atari Episode Score: 88.00
Atari Episode Length 1339
Atari Episode Score: 26.00
Atari Episode Length 903
Atari Episode Score: 239.00
Atari Episode Length 1408
Atari Episode Score: 88.00
Atari Episode Length 1339
Atari Episode Score: 88.00
Atari Episode Length 1339
Atari Episode Score: 239.00
Atari Episode Length 1408
Atari Episode Score: 88.00
Atari Episode Length 1339
Atari Episode Score: 26.00
Atari Episode Length 903
Atari Episode Score: 26.00
Atari Episode Length 903
Atari Episode Score: 239.00
Atari Episode Length 1408
Atari Episode Score: 80.00
Atari Episode Length 1269
Atari Episode Score: 80.00
Atari Episode Length 1269
Atari Episode Score: 26.00
Atari Episode Length 903
Atari Episode Score: 80.00
Atari Episode Length 1269
Atari Episode Score: 239.00
Atari Episode Length 1408
Atari Episode Score: 88.00
Atari Episode Length 1339
Atari Episode Score: 80.00
Atari Episode Length 1269
Atari Episode Score: 80.00
Atari Episode Length 1269
Atari Episode Score: 239.00
Atari Episode Length 1408
Atari Episode Score: 80.00
Atari Episode Length 1269
Atari Episode Score: 80.00
Atari Episode Length 1269
Atari Episode Score: 80.00
Atari Episode Length 1269
Atari Episode Score: 239.00
Atari Episode Length 1408

So score mostly hangs around in 80, occasionally hitting 239. However more concerning thing is that there is no aparent convergence as shown by Tensorboard graph. For sanity check, I also trained model for Pong which shows good convergence:

I'm suspecting something is broken and train.py is no longer generating model that would reproduce high scores for Breakout. I've also tested OpenAI baselines and the monitor.csv which is supposed to have raw score is also similarly stuck in low 30s.

sytelus commented 4 years ago

Digging more, the average of episode scores is 113.8 from enjoy.py which is still not out of line from 131.4 from OpenAI and 123 from RLLib.

So the question that remains: Is above training curve expected? It's nothing like Pong as I posted above. Even if you do smoothing, there is not much a pattern of convergence in that curve.

araffin commented 4 years ago

However more concerning thing is that there is no aparent convergence as shown by Tensorboard graph. So the question that remains: Is above training curve expected?

As mentioned before, the episode reward does not represent the score of the agent (you can take a look at the wrappers here). One episode correspond to one life, so in breakout, compared to pong, you will lose more often a life, especially when you are taking random actions (it stays epsilon greedy during training), that's why the learning curve is much more chaotic. If you want to monitor the true training reward, you will have to modify a bit the DQN code to include information from the Monitor wrapper (as it is done here for SAC). We would appreciate a PR if you do so ;)

To monitor it in tensorboard, you will just have to follow the doc, as you mentioned a call to logger.configure() is missing.

Again, monitoring the training reward is only a proxy to the true performance (more on that below ;) )

Digging more, the average of episode scores is 113.8 from enjoy.py which is still not out of line from 131.4 from OpenAI and 123 from RLLib.

First, I recommend you reading How many random seed should I use? by @ccolas and RL that matters

I assume you trained using only one random seed? Then how many test episodes/test steps did you use? What is the variance of the results? If you look at the trained agent in the repo, it has a mean test score of 191 but with a variance of 91 over 150k test steps.

Then you have to know that you are training a Double Dueling Deep Q Network with Prioritized Experience Replay (and not a vanilla DQN).

araffin commented 4 years ago

The results were reproduced recently by Anssi using both SB2 and SB3 code: https://github.com/DLR-RM/stable-baselines3/pull/110#pullrequestreview-460304572

He disabled all extensions for that (no PER, no double/dueling q learning).

araffin / rl-baselines-zoo

Cannot reproduce DQN Breakout baseline #49