In the original paper of IMPALA, the state value estimation and the action were the output of the same net, and the net was updated with the sum of three losses , which is not usual in the actor-critic algorithm.
The AtariNet in monobeast used baseline net and policy net to estimation the state value and output the action separately. So should we update the baseline net with the baseline loss and update the policy net with the policy gradient loss seperately in an actor-critic way?
In the original paper of IMPALA, the state value estimation and the action were the output of the same net, and the net was updated with the sum of three losses , which is not usual in the actor-critic algorithm.
The AtariNet in monobeast used baseline net and policy net to estimation the state value and output the action separately. So should we update the baseline net with the baseline loss and update the policy net with the policy gradient loss seperately in an actor-critic way?