Open guydav opened 5 years ago
Another question which I'll tack onto here -- the default value for target-update
parameter is 8000, which matches Table 1 in the Rainbow paper, which reports it as 32k frames.
Do you have a sense of why the data-efficient Rainbow paper, in Table 2 (Appendix E), reports the update period for both Rainbow and the data-efficient Rainbow as being every 2000 updates?
Ah honestly using a fixed number of episodes is something I came up with (as it makes sense, keeps statistics easy, and also works across other environments), and I completely overlooked that evaluation detail. My default - 10 episodes - could run for max ~2x as long as DeepMind's procedure, but I feel like 5 episodes may be a bit too little to get a good estimate? So I'm in favour of keeping what I've done, but maybe noting in the readme that this differs from the original procedure - what do you think?
As noted in the footnote of the data-efficient paper, the target network update period is reported with respect to the online network update, which is only every 4 steps in the original and hence 8000 steps for the target network update.
I think that makes sense. I don't know why you'd evaluate over a fixed number of frames rather than episodes. You could make a TODO to eventually implement their evaluation procedure too? It wouldn't be hard at all but might take a bit of someone's time.
Another realization regarding the target-update bit. These 32K frames are not, actually, 32K unique frames, right? If I understand correctly, we repeat every action 4 times, add the max pool of the last two action frames to the state (and drop the oldest frame from the state), and pass the 4-frame state buffer to the agent.
In other words, every pair of agent actions shares three of the four frames in the environment state, correct?
(also, which paper does the max pool of the last two frames of the action repetition come from, if at all? I'm just trying to trace all of these implementation details.)
OK I've added a TODO with 2188966 .
You are correct. I can't remember where this was first reported in a paper, but I've spent years trying to replicate DeepMind's results, and I did a lot of digging into their Atari wrapper repositories and the DQN source code released with the Nature paper to get these implementation details.
Thanks for the clarification. There appears to be so much voodoo around the implementation details that make it quite hard to know when you can trust your results.
It's interesting that the original Rainbow paper frames it as updating ever 32K frames, which, while strictly true, is far fewer than that in actual games frames given the overlap.
@Kaixhin -- it seems that most papers also evaluate on either no-op starts or human starts. Did you ever take a stab at implementing either?
No-op starts are still used during evaluation. Haven't tried human starts, but have no idea where you would get them from (presumably internal to DeepMind).
Ah, I see, in env.reset(). That makes sense.
Hi Kai,
In the Rainbow paper, the evaluation procedure is described as
However, the code as written tests for a fixed number of episodes. Am I missing anything? Or is this the procedure from the data-efficient Rainbow paper (I couldn't find a detailed description there).
Thanks!