Closed DongChen06 closed 4 years ago
The above code is for running the update based on a sampled minibatch. To find the update frequency, you can go to IQL.backward()
of agents/models.py
. Under the default setting, once we collect a minibatch of experience (size n_step
), we update IQL 10 times. You are right: for fair comparison, we should use a similar updating frequency as that of MA2C/IA2C, which is, once per minibatch. However in this way the advantage of off-policy learning was ignored so I made it 10x. As an alternative approach, we can keep the same updating frequency but make the IQL baseline stronger w/ DDPG-like updates.
Hi, thank you for your reply! Do we need to keep the parameters of the target network frozen and update it a little bit slower than the Q-network? BTW, I am still confused why we can update IQL 10 times at each time? Why we choose 10 and any difference between 1 time and 10 times update? Thank you!
Q-learning is off-policy so we can update Q-net any times -- each update is based on a sampled minibatch from the replay buffer rather than the recently collected one. More updates, more learning (exploitation) on the experience so far. Only original DQN is implemented here, so a single Q-net is used for both target and behavior policies (and freezing it may hurt exploration efficiency). You may need to modify the code to implement DDPG-like update by maintaining both a behavior Q-net and a less frequently updated target Q-net, and additional behavior and target policy-nets if needed. However, as pointed out in the paper, off-policy learning cannot learn the time-variant transitions, especially under partial observability. This is why on-policy learning like A2C is used. On the other hand, investigating the improvement on off-policy learning would be an interesting future direction. For example, n-step TD based Q-learning [1] may be a good start point, but you may need to extend it to MARL settings. [1] Nachum, Ofir, et al. "Bridging the gap between value and policy based reinforcement learning." Advances in Neural Information Processing Systems. 2017.
Thank you for your patient and insights. I may try to extend the method to a DDPG-like algorithm and look at its improvement.
Hi, it seems you update the target network every step along with the q network in IQL network. Shown as in the policies.py.
Should we try to update the target network every N(1000?) steps? Thanks!