Kaixhin / Rainbow

Rainbow: Combining Improvements in Deep Reinforcement Learning
MIT License
1.56k stars 282 forks source link

Prioritised Experience Replay #7

Closed marintoro closed 6 years ago

marintoro commented 6 years ago

I am interested in implementing Rainbow too. I didn't go deep in code for the moment, but I just saw on the Readme.md that Prioritised Experience Replay is not checked. Will this feature be implemented or it is maybe already working? On their paper, Deepmind are actually showing that Prioritized Experience Replay is the most important feature, that means the "no priority" got the bigger performance gap with the full Rainbow.

Kaixhin commented 6 years ago

PER is still in progress on this branch. Most of it is in place, but the samples need to checked if they are valid and the need to be put together in a batch for learning. Unfortunately I don't have time to finish this any time soon, so if anyone would like to finish it off and submit a PR that would be really appreciated.

Kaixhin commented 6 years ago

OK I've finished an implementation and merged it with master. Leaving this issue open because I haven't tested it empirically/there's a reasonable chance of bugs.

marintoro commented 6 years ago

Oh nice!! I don't really have time to test it right now tho, but I will definitely try it at the beginning of January!

Kaixhin commented 6 years ago

For anyone who can work on this, here's a list of issues with the memory (not including the fact that correct indexing is easy to get wrong):

marintoro commented 6 years ago

So today I got a little time to look on those differents issues. For the max, it's true, it's not updated properly but after rereading the initial paper of "Prioritized Experience Replay" I don't think they use a max as you are trying to. I think they just put new transition with a fixed high initial priorities which is actually what's happening in the current implementation. I am saying this cause of this sentence from the paper "New transitions arrive without a known TD-error, so we put them at maximal priority in order to guarantee that all experience is seen at least once"

marintoro commented 6 years ago

I think there is actually a problem when the sampled transition is straddling the current buffer index. More specificly on the calcul of the discounted reward (line 121 in memory.py).

returns = [R + self.discount ** n * transition.reward for R, transition in zip(returns, full_transitions[self.history + n])]

Indeed the transition.reward could be not null if we reach the buffer index, a quick and dirty fix would be to resample the last index if it happens (only the last one could sometimes be too close from the buffer index...)?

marintoro commented 6 years ago

Finally I think there is a little bug in index line 105

full_transitions = [[self.transitions.get(i + t) for i in idxs] for t in range(1 - self.history, self.n + 1)]

should be

full_transitions = [[self.transitions.get(i + t) for i in idxs] for t in range(1 - self.history, self.n)]

and line 121

returns = [R + self.discount ** n * transition.reward for R, transition in zip(returns, full_transitions[self.history + n])]

should be

returns = [R + self.discount ** n * transition.reward for R, transition in zip(returns, full_transitions[self.history + n - 1])]

Indeed in the current implementation we never use the reward from full_transitions[self.history] and we go one index too far.

Kaixhin commented 6 years ago

Thanks for looking into this! If you could collect these small fixes into a PR that'd be really useful. I would love to do a test run with these fixes myself but unfortunately I've lost access to my equipment :'(

marintoro commented 6 years ago

Ok, I will try to do it next week if I got time.

Kaixhin commented 6 years ago

Managed to have a look, but I think the initial indexing is fine (as range in non-inclusive at the end) - it returns frames from the current state s_{0-(4 - 1):0} = s_{-3:0} and the nth (3rd) next state s_{0:0+3} = s_{0:3}. And rewards from the time step ahead r_{t+1} are actually stored with frames/actions from the previous timestep s_t/a_t. So the steps for the return calculation should be r_1 + γr_2 + γ^2r_3. And this is part of the update rule along with the 3rd next state (which is stored alongside r_4). So if you could double-check that'd be useful (as an aside your fixes break the code for getting the nth next state, I think due to non-inclusive range).

marintoro commented 6 years ago

Ahah indexing is so easy to get wrong and I can't run the code right now (I am just reading it there). But if current state (s_0) is stored with the reward from the time step ahead (i.e. r_1) we got :

r_1 = full_transitions[self.history - 1].reward #Like at line 118

And then we want to compute the discounted_reward = r_1 + γr_2 + γ^2r_3 is

returns = [R + self.discount ** n * transition.reward for R, transition in zip(returns, full_transitions[self.history + n])]

but like r_2 is stored in full_transitions[self.history] and not in full_transitions[self.history + 1] we are actually taking r_1 + γr_3 + γ^2r_4 (cause n goes from 1 to 2 in the range(1,self.n) and that's why I said you need to add a -1

Kaixhin commented 6 years ago

Good catch - I updated the returns calculation in f398c1a. I've also done a check to discard transitions that straddle the buffer index in 9cb5f59 - would be good to check. This may well happen a lot so should be caught, and finding unique, valid transitions to replace these invalid ones is relatively difficult, so I've made the decision to just drop them from the batch.

marintoro commented 6 years ago

I left a comment to the commit on returns calculation bug (I think there is still a problem). I am sorry, I don't have access to my coding machine there, will try to do a proper PR if there is still bugs next week.

Kaixhin commented 6 years ago

Nothing promising with a long run on Space Invaders. I have a new branch no-prior which should have a normal experience replay memory for testing, as it is still likely that the prioritised replay is the problem.

newplot

marintoro commented 6 years ago

Ah too bad! So there is still a sneaky bug in the current implementation. I may have some time this week to look for this! Btw did you launch a full training on the branch "no-prior"? If you did, you could compare between the branch "no-prior" result and the current "master", to see if it's really the prioritized experience replay which is the problem.

Kaixhin commented 6 years ago

Yep I'm running Space Invaders on no-prior as a comparison - I only had time to quickly revert to a uniform replay memory, so hopefully there are no bugs there. Unfortunately without prioritized experience replay nothing really happens until the end of training, so it'll have to run for about 2 weeks to see. I also made a small change to reduce the effect of clamping in 20f7df3 just in case, but I doubt that this should be an issue.

aiXander commented 6 years ago

Fantastically clean implementation! Too bad it's still not perfect, I'd like to help developping this.. I ran some initial runs and it seems like the action distributions are very heavily squewed in the beginning of training. Even if I lower the noise in the linear layers the exploration seems very un-uniform, any idea why this is? For example in a game like pong, the agent almost never gets a reward since it simply isn't moving...

Kaixhin commented 6 years ago

@tr1pzz Thanks! If you're looking for bugs, they're most likely in the replay memory, so stepping through the logic manually (rather than assuming the comments are correct) is one way to check. Definitely the max value isn't being updated properly, and perhaps that's an issue. Another difference is that I haven't fully replicated DM's environment alterations - OpenAI have an attempt here - and this again could be slowing down/disrupting learning.

It's not mentioned in DM's work because they don't do it, but one way to try and encourage uniform Q-values at the start is to instead use orthogonal weight initialisation (nn.init.orthogonal) with a small gain factor (e.g. 0.1).

marintoro commented 6 years ago

Do you have first result of your training on the branch no-prior on Space Invaders or is that still too soon to be really compared with the master training?

I don't really think the max value is an issue, in their paper DM just say that they put new transitions with maximum priority "to guarantee that all experience is seen at least once" (maybe you could up the initial max value though). On the other hand the DM's environment alterations you are talking about seems really interesting (particulary the "FireResetEnv", actually when I was training on Breakout it was not working at all cause of this exact issue)

Finally in order to encourage uniform Q-values at start, could we just use an ε-greedy policy at the really early stage of the training (and annihilate the ε to 0 after few steps, like it is done in DM's papers when no noisy networks are used)?

aiXander commented 6 years ago

Using an ε-greedy policy early on seems like a good idea, the noisy layers really take a while to converge to usefull action distributions in my runs.. Additionally I had two consecutive runs on remote VM's that crashed due to running out of cache memory, any idea how much memory the 1M replay states take?

Kaixhin commented 6 years ago

@marintoro so far the results are very similar, which is what is indicated in the graphs in the paper. I agree that the current max implementation is unlikely to be an issue, but it's almost certainly wrong, so I'll try to deal with it at some point. If you spot any things from the environment alterations that are missing like FireResetEnv then we should probably do experiments with those (though Space Invaders doesn't seem affected).

If people want to use ε-greedy at the start to try and get better performance/help debug then go ahead, but as a reference implementation for Rainbow I won't merge that into master.

The full replay memory will take roughly 6-7GB. This is a lot, hence the storage of states as ByteTensors rather than FloatTensors to prevent it from being 4x the amount.

Kaixhin commented 6 years ago

Simple sanity check - does master play Pong? Yes. newplot 1 newplot

marintoro commented 6 years ago

Hum I already tried 3 times, I have even take your exact code for the last trial. But it never works on Pong. After something like 2/3Millions steps, the atari environment get stuck during test (the enemy paddle is stuck in the score tab in the rendered screen...). I think it's a bug with ALE which can be fixed with the atari_wrapper from OpenAI maybe. But I don't understand why you don't get this problem, maybe we got different version of atari-py? Can you tell me your version?

Kaixhin commented 6 years ago

I'm using atari-py 0.1.1. I've come across this hanging on occasion, but I've never been able to reproduce it. If you are able to investigate using pdb or gdb then they should be able to help (unfortunately I've never really used these so can't offer any advice). Otherwise you might want to try porting atari_wrapper to see if that does work?

charles882 commented 6 years ago

Hi do you find that sometimes when sampling data it would stuck in a loop? it would sample from a segment and if that segment is close to the latest transition index, it get flagged as invalid sample and it tries resample from that segment again and repeat itself and hangs

Kaixhin commented 6 years ago

@charles882 Ah that might have been introduced by 098ccad, as previously I was dropping invalid samples, but saw that could result in quite a few batches of size 31 (and your comment points to the segment containing the transition index being the problem). So far I haven't had an issue with hanging so far, so can you let me know if you've changed any parameters from the default? I think it should always be possible to find a valid sample in each segment, but I'll try have a look at the rejection criteria closer to tidy those up.

charles882 commented 6 years ago

@Kaixhin After checking again I think the code is okay, my apologies. That's correct, my comment was referring to latest update with 098ccad.

I should have mentioned I did not run the full code, I am trying to understand and find implementation of priority memory replay. So I only take the memory.py file, then inputting samples and retrieving it, running parts of the code as I was trying to understand the implementation. I think the rejection criteria is sound you can ignore my previous comment.

I made a mistake when running update_priorities using data index instead of tree index and this i believe caused a loop that it keep finding data with 0 probabilities (the bottom nodes was never updated with values since I used data indexes), the other error loop that I mentioned above happens when the memory size is too small.

Kaixhin commented 6 years ago

After 5c252ea, I got to a run in Space Invaders where I saw max scores reaching 6000 before my machine unfortunately died. Therefore I'm closing this in favour of https://github.com/Kaixhin/Rainbow/issues/15, a more general issues thread, as it's unlikely that prioritised experience replay is still the problem.