Closed Da-Capo closed 4 years ago
Hey Peter, thanks for your interest!
You're right, in both Polybeast and Monobeast we update the actor policy "hogwild", i.e., whenever a learning step has happened, including during a rollout. Think of it as an additional exploration feature :).
Also do feel free to change and play around with that logic in your copy of the codebase. Notice that to keep the policy fixed during a rollout generation, one would need to keep num_actors
many copies of the model around, at least naively.
By the way, even the original Impala implementation does this hogwild update in its "single-machine" training setup, see the comment at https://github.com/deepmind/scalable_agent/blob/master/experiment.py#L507. In TorchBeast, our aim is to implement an Impala-like agent, not necessarily do 100% what the Impala paper said. That said, I don't believe updating the actor policy this way hurts training performance at all.
The paper of impala say :
does it means that we should keep the same policy in one trajectory. I think the implementation here maybe update the policy at sampling that will cause different policy. https://github.com/facebookresearch/torchbeast/blob/ddeec0174fe651649f27452fb684aa7745d3f00d/torchbeast/monobeast.py#L162