Should we keep the same policy in one trajectory ?

Hey Peter, thanks for your interest!

You're right, in both Polybeast and Monobeast we update the actor policy "hogwild", i.e., whenever a learning step has happened, including during a rollout. Think of it as an additional exploration feature :).

Also do feel free to change and play around with that logic in your copy of the codebase. Notice that to keep the policy fixed during a rollout generation, one would need to keep num_actors many copies of the model around, at least naively.

By the way, even the original Impala implementation does this hogwild update in its "single-machine" training setup, see the comment at https://github.com/deepmind/scalable_agent/blob/master/experiment.py#L507. In TorchBeast, our aim is to implement an Impala-like agent, not necessarily do 100% what the Impala paper said. That said, I don't believe updating the actor policy this way hurts training performance at all.

facebookresearch / torchbeast

Should we keep the same policy in one trajectory ? #8