Open ankeshanand opened 4 years ago
Hi, I think you're right! Sorry about that!
When scaling up to bigger batch sizes and more parallel environments, I think I also often used a setting like batch_size=256
, sampler.batch_B=16
, sampler.batch_T=2
, and to keep the same 1e4 env steps would use target_update_interval=312
, but possibly that number is a coincidence.
The default target_update_interval for DQN based algorithms is set as 312 and not changed for any of the variants in the configs (except for R2D1, which seems to be correctly set to 2500). I don't think the default matches the DQN / Rainbow papers however.
This comment here says the default replay of 312 should correspond to 1e4 env steps. But using a replay ratio of 8 and batch size of 32 would mean that we do an optimizer update every 4 env steps. That corresponds to updating the target network every 4*312 = 1248 steps in the environment.
Is my understanding correct? And if so, shouldn't the default value of
target_update_interval
be 2500 gradient steps (2500*4=1e4 env steps) for DQN and 2000 for Rainbow (32K frames it says in the paper)?