Is the evluation procedure different?

Kaixhin / Rainbow

Rainbow: Combining Improvements in Deep Reinforcement Learning

MIT License

1.59k stars 284 forks source link

Is the evluation procedure different? #57

Open guydav opened 5 years ago

guydav commented 5 years ago

Hi Kai,

In the Rainbow paper, the evaluation procedure is described as

The average scores of the agent are evaluated during training, every 1M steps in the environment, by suspending learning and evaluating the latest agent for 500K frames. Episodes are truncated at 108K frames (or 30 minutes of simulated play).

However, the code as written tests for a fixed number of episodes. Am I missing anything? Or is this the procedure from the data-efficient Rainbow paper (I couldn't find a detailed description there).

Thanks!

guydav commented 5 years ago

Another question which I'll tack onto here -- the default value for target-update parameter is 8000, which matches Table 1 in the Rainbow paper, which reports it as 32k frames.

Do you have a sense of why the data-efficient Rainbow paper, in Table 2 (Appendix E), reports the update period for both Rainbow and the data-efficient Rainbow as being every 2000 updates?

Kaixhin commented 5 years ago

Ah honestly using a fixed number of episodes is something I came up with (as it makes sense, keeps statistics easy, and also works across other environments), and I completely overlooked that evaluation detail. My default - 10 episodes - could run for max ~2x as long as DeepMind's procedure, but I feel like 5 episodes may be a bit too little to get a good estimate? So I'm in favour of keeping what I've done, but maybe noting in the readme that this differs from the original procedure - what do you think?

As noted in the footnote of the data-efficient paper, the target network update period is reported with respect to the online network update, which is only every 4 steps in the original and hence 8000 steps for the target network update.

guydav commented 5 years ago

I think that makes sense. I don't know why you'd evaluate over a fixed number of frames rather than episodes. You could make a TODO to eventually implement their evaluation procedure too? It wouldn't be hard at all but might take a bit of someone's time.

Another realization regarding the target-update bit. These 32K frames are not, actually, 32K unique frames, right? If I understand correctly, we repeat every action 4 times, add the max pool of the last two action frames to the state (and drop the oldest frame from the state), and pass the 4-frame state buffer to the agent.

In other words, every pair of agent actions shares three of the four frames in the environment state, correct?

(also, which paper does the max pool of the last two frames of the action repetition come from, if at all? I'm just trying to trace all of these implementation details.)

Kaixhin commented 5 years ago

OK I've added a TODO with 2188966 .

You are correct. I can't remember where this was first reported in a paper, but I've spent years trying to replicate DeepMind's results, and I did a lot of digging into their Atari wrapper repositories and the DQN source code released with the Nature paper to get these implementation details.

guydav commented 5 years ago

Thanks for the clarification. There appears to be so much voodoo around the implementation details that make it quite hard to know when you can trust your results.

It's interesting that the original Rainbow paper frames it as updating ever 32K frames, which, while strictly true, is far fewer than that in actual games frames given the overlap.

guydav commented 5 years ago

@Kaixhin -- it seems that most papers also evaluate on either no-op starts or human starts. Did you ever take a stab at implementing either?

Kaixhin commented 5 years ago

No-op starts are still used during evaluation. Haven't tried human starts, but have no idea where you would get them from (presumably internal to DeepMind).

guydav commented 5 years ago

Ah, I see, in env.reset(). That makes sense.