astooke / rlpyt

Reinforcement Learning in PyTorch
MIT License
2.22k stars 323 forks source link

Possible misconfiguration in target_update_interval #150

Open ankeshanand opened 4 years ago

ankeshanand commented 4 years ago

The default target_update_interval for DQN based algorithms is set as 312 and not changed for any of the variants in the configs (except for R2D1, which seems to be correctly set to 2500). I don't think the default matches the DQN / Rainbow papers however.

This comment here says the default replay of 312 should correspond to 1e4 env steps. But using a replay ratio of 8 and batch size of 32 would mean that we do an optimizer update every 4 env steps. That corresponds to updating the target network every 4*312 = 1248 steps in the environment.

Is my understanding correct? And if so, shouldn't the default value of target_update_interval be 2500 gradient steps (2500*4=1e4 env steps) for DQN and 2000 for Rainbow (32K frames it says in the paper)?

astooke commented 4 years ago

Hi, I think you're right! Sorry about that!

When scaling up to bigger batch sizes and more parallel environments, I think I also often used a setting like batch_size=256, sampler.batch_B=16, sampler.batch_T=2, and to keep the same 1e4 env steps would use target_update_interval=312, but possibly that number is a coincidence.