facebookresearch / torchbeast

A PyTorch Platform for Distributed RL
Apache License 2.0
734 stars 113 forks source link

Should we keep the same policy in one trajectory ? #8

Closed Da-Capo closed 4 years ago

Da-Capo commented 4 years ago

The paper of impala say :

At the beginning of each trajectory, an actor updates its own local policy µ to the latest learner policy π and runs it for n steps in its environment.

does it means that we should keep the same policy in one trajectory. I think the implementation here maybe update the policy at sampling that will cause different policy. https://github.com/facebookresearch/torchbeast/blob/ddeec0174fe651649f27452fb684aa7745d3f00d/torchbeast/monobeast.py#L162

heiner commented 4 years ago

Hey Peter, thanks for your interest!

You're right, in both Polybeast and Monobeast we update the actor policy "hogwild", i.e., whenever a learning step has happened, including during a rollout. Think of it as an additional exploration feature :).

Also do feel free to change and play around with that logic in your copy of the codebase. Notice that to keep the policy fixed during a rollout generation, one would need to keep num_actors many copies of the model around, at least naively.

By the way, even the original Impala implementation does this hogwild update in its "single-machine" training setup, see the comment at https://github.com/deepmind/scalable_agent/blob/master/experiment.py#L507. In TorchBeast, our aim is to implement an Impala-like agent, not necessarily do 100% what the Impala paper said. That said, I don't believe updating the actor policy this way hurts training performance at all.