Kaixhin / Rainbow

Rainbow: Combining Improvements in Deep Reinforcement Learning
MIT License
1.56k stars 282 forks source link

Replicating DeepMind results #15

Closed Kaixhin closed 6 years ago

Kaixhin commented 6 years ago

As of 5c252ea, this repo has been checked over several times for discrepancies, but is still unable to replicate DeepMind's results. This issue is to discuss any further points that may need fixing.

Space Invaders (averaged losses): newplot

Space Invaders (summed losses): newplot

stringie commented 6 years ago

I believe I read somewhere that the loss should be the max over the minibatch, but I think that sum should be just as fine.

Kaixhin commented 6 years ago

Asked Matteo a few questions, placed his responses here, and will do some new runs based on this new information.

Kaixhin commented 6 years ago

Trying a new game - Frostbite - which already seems to work fine! Space Invaders results are still along previous results, but it takes longer to see if it makes progress to the same scores.

newplot

stringie commented 6 years ago

Have you tried to see how it performs on breakout? I'm really curious.

Kaixhin commented 6 years ago

I have limited resources so not yet but plan to eventually. If that doesn't work I'll add the fire-on-reset wrapper to the environment and try again.

stringie commented 6 years ago

I understand. It really is power hungry to run.

On Wed, Mar 14, 2018, 15:55 Kai Arulkumaran notifications@github.com wrote:

I have limited resources so not yet but plan to eventually. If that doesn't work I'll add the fire-on-reset wrapper to the environment and try again.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/Kaixhin/Rainbow/issues/15#issuecomment-373028894, or mute the thread https://github.com/notifications/unsubscribe-auth/AWGOjivIyOu8D_0v94gvGNmQVOaV6cjgks5teSFQgaJpZM4SR-HE .

stringie commented 6 years ago

Have you seen the paper on Distributed Prioritized Experience Replay? The results look amazing relative to what Rainbow can achieve. I wonder if it can also somehow be integrated in your project. Here's the paper: https://arxiv.org/pdf/1803.00933.pdf

Kaixhin commented 6 years ago

@stringie this issue is to track "replicating DeepMind results", so this is not relevant here. Anyway, adding extra components is a) currently not within the scope of this project b) unhelpful when results from the original Rainbow paper still cannot be replicated.

Ashutosh-Adhikari commented 6 years ago

@Kaixhin is it possible to try individual components of Rainbow and check whether the performances are similar to the respective papers?

Kaixhin commented 6 years ago

@Ashutosh-Adhikari if I had written the code in a modular fashion that allowed these kinds of experiments. There's plenty of things I've not needed to include by just aiming for the full Rainbow model, and I don't have the capacity to refactor the code in this manner at the moment.

Ashutosh-Adhikari commented 6 years ago

@Kaixhin So I recently incorporated the PER from your code to a vanilla DQN and ran it for Breakout. It seems, PER is where the issue lies. As it is not behaving as per the claims in Prioritized Experience Replay paper (slightly worse than DQN while training, although more stable).

The experiments were run for Replay memory size of Replay_Mem_size_in_DQN_paper/5 i.e. 200k. But not sure where the issue lies.

Also, I did not quite understand the function of attribute 'n' of a ReplayMemory object in your code in memory.py and henceforth the logic behind line 109 and 116 in memory.py.

Kaixhin commented 6 years ago

@Ashutosh-Adhikari that's useful to know, thanks. Is any code public so that I can have a look at it?

n comes from n-step backups, so if you set n = 1 you should recover most of the original algorithms.

Ashutosh-Adhikari commented 6 years ago

@Kaixhin Thanks for the clarification regarding n. Give me a few days, I shall make the code public.

Till then, the basic code I used was from https://github.com/hengyuan-hu/rainbow.git.

The way they store the samples is slightly different, for ex. a sample will contain both the current as well as next state. But nevertheless, that should not distort the logic.

Also, not sure how scaling down the Replaymemory would affect the relative performances for modules related to experience replay (any insights over this?). I think the performance difference might become lesser between vanilla DQN and PER DQN, but vanilla DQN > PER DQN should not happen (?).

Kaixhin commented 6 years ago

@Ashutosh-Adhikari actually PER can decrease performance on a few games. If you check Figure 6 in the appendix of the Rainbow paper you will see that removing PER from Rainbow actually improves performance in Breakout. In this case you would want to test against a game that the paper shows is clearly positively effected, such as Yars' Revenge. For normal ER I would say that /5 may be OK, but with PER I'm less certain.

Ashutosh-Adhikari commented 6 years ago

@Kaixhin I think while we are referring to vanilla DQN and addition of PER to it (since we are trying to check PER as a module), we might want to focus on PER paper. Which mentions improvement in Table 6 of its appendix. Or no?

If this is the case, then scaling down should affect or not (again in relative terms only)?

Because, I believe, when we are removing PER from Double, Duelling, etc. DQN, it might become too complex to draw a conclusion from.

Kaixhin commented 6 years ago

@Ashutosh-Adhikari yes sorry the repo you pointed to does have most of the parts of Rainbow but if you are trying to do a comparison against just the vanilla DQN then it's best to check Figure 7 of the original PER paper, as the learning curves (which have been smoothed) are the most informative. The big caveat is that the double DQN paper introduced a new set of hyperparameters for the DQN algorithm which work better for the double DQN, but I'm pretty sure the baseline results shown in Figure 7 don't use these improved hyperparameters (whereas the PER model does). Nevertheless I think the PER learning curves should be unique so Frostbite or Space Invaders are good games to test if PER is working correctly.

Ashutosh-Adhikari commented 6 years ago

@Kaixhin I think I can check for Space Invaders (in a few days) after incorporating n and keeping n = 1. However, on a second run for Breakout, the best average (over 10 episodes) val reward for PER : 387 and for vanilla DQN : 384. The improvement is definitely not significant. Note :1) ReplayMem size of 200k (1e6/5). 2) Also, not running for more than 25M (this should not be an issue). 3) Sharing the Training curve seems illogical as of now.

image

Kaixhin commented 6 years ago

@Ashutosh-Adhikari according to Figure 7 and other stats from the paper there isn't much difference between Breakout scores for all of the different methods, so it's difficult to draw conclusions from (the best are games where there should be a clear difference in learning). Especially as the published results can come from taking an average over several runs.

If you're still investigating then if you could also plot a) the max priority b) the priorities of the 32 samples per minibatch over time, then that might provide some clues if things are going wrong?

Ashutosh-Adhikari commented 6 years ago

@Kaixhin Yep, the way they are taking averages is way different. Taking average over a 100 episodes, that would definitely increase that value I am sure (law of large numbers, I think :P, as when we are taking average over 10 episodes only, even one or two rare poor performances are too many to bring that count lower).

Nevertheless, one should try it on different games as mentioned by you, right. So, basically we can't be sure as of now, whether there is a bug in PER or not, right?

Kaixhin commented 6 years ago

@Ashutosh-Adhikari averaging over more evaluation episodes would result in a better approximation to the true mean performance, it shouldn't bias the value either way. If anything it may result in smoother curves but the values should be comparable.

It's pretty hard to tell where the bug is (could even be in PyTorch if we're really unlucky), so yes it may not be in PER but it's one of the trickiest things to get right so it's most likely to be there. In my experience the best way to debug by running experiments is to pick options based on learning curves where there are clear differences between the options. Unit tests would also be a good idea, but depend on understanding the items correctly in the first place.

Ashutosh-Adhikari commented 6 years ago

@Kaixhin Just added a pull request to https://github.com/hengyuan-hu/rainbow.git for PER code inspired by your repo.

Kaixhin commented 6 years ago

Closing as results on Space Invaders are good and show a clear difference against previous methods: newplot

Enduro (still 2x reported, not sure why): newplot 2

Frostbite: newplot 1

albertwujj commented 5 years ago

Hi Kaixhin,

Are the hyperparams you used for these results currently the default ones in main.py?

Additionally, in the 'Human-level control' DQN paper, there is this note under Extended Table 4:

image

At 60 fps, 5 minutes is 18000 frames. This may be related to why your Enduro results are 2 times higher.

Thanks, Albert

Kaixhin commented 5 years ago

@albertwujj if you check the releases you'll see that Enduro results match the paper now at just over 2k. The scores in general can be highly impacted by the total frame cap for the Atari games and the ε value used for ε-greedy during validation.