facebookresearch / torchbeast

A PyTorch Platform for Distributed RL
Apache License 2.0
734 stars 113 forks source link

Should we update the ValueNet and PolicyNet with the different loss? #12

Closed YuhwaChoong closed 4 years ago

YuhwaChoong commented 4 years ago

In the original paper of IMPALA, the state value estimation and the action were the output of the same net, and the net was updated with the sum of three losses , which is not usual in the actor-critic algorithm.

The AtariNet in monobeast used baseline net and policy net to estimation the state value and output the action separately. So should we update the baseline net with the baseline loss and update the policy net with the policy gradient loss seperately in an actor-critic way?

YuhwaChoong commented 4 years ago

Sorry for the issue. I've found out that the vtrace advantage is detached while computing pg_loss and pg_entropy.