Closed mctigger closed 2 years ago
It's because the time step alignment is ASR in this code base, i.e. the action of a certain index in the trajectory leads to the state and reward at the same index. The example in the comment at the beginning of actor_loss()
may be helpful. The index 0 of the trajectory contains zeros for the action and the start state of the imagination rollout as state. This state is not influenced by any of the imagined actions, so it's not included in the loss. This also means that the value at index 1 depends on the first imagined action, so the corresponding baseline should be the value at index 0. That's why the baseline starts from index 0 but the target from index 1.
Hi Danijar, the critic loss is calculated without the offset identical to how it is stated in the paper. https://github.com/danijar/dreamerv2/blob/52fc568f46d25421fbdd4daf75fddd6feabca8d4/dreamerv2/agent.py#L299-L302
However, for the actor loss there is this offset by 1 (skip first target). Could you explain why this is the case? https://github.com/danijar/dreamerv2/blob/52fc568f46d25421fbdd4daf75fddd6feabca8d4/dreamerv2/agent.py#L272
This is how I imagine the advantage should be calculated (simplified without lambda-target). s_t is the current state of the agent. r_t is the reward of this state and should be ignored, since we are already in the state.
A = target(s_t) - baseline(s_t) = (rt + r{t+1} + E[r_{t+2} + ...]) - (rt + E[r{t+1} + r{t+2} + ...]) = (r{t+1} + E[r{t+2} + ...]) - (E[r{t+1} + r_{t+2} + ...]) = Q(a_t,s_t) - V(s_t)
If I understand your code correctly, as a result of the offset, the reward r_t will not cancel and the advantage will be wrong?