danijar / dreamerv2

Mastering Atari with Discrete World Models
https://danijar.com/dreamerv2
MIT License
886 stars 195 forks source link

Offsets in actor loss calculation #27

Closed mctigger closed 2 years ago

mctigger commented 2 years ago

Hi Danijar, the critic loss is calculated without the offset identical to how it is stated in the paper. https://github.com/danijar/dreamerv2/blob/52fc568f46d25421fbdd4daf75fddd6feabca8d4/dreamerv2/agent.py#L299-L302

However, for the actor loss there is this offset by 1 (skip first target). Could you explain why this is the case? https://github.com/danijar/dreamerv2/blob/52fc568f46d25421fbdd4daf75fddd6feabca8d4/dreamerv2/agent.py#L272

This is how I imagine the advantage should be calculated (simplified without lambda-target). s_t is the current state of the agent. r_t is the reward of this state and should be ignored, since we are already in the state.

A = target(s_t) - baseline(s_t) = (rt + r{t+1} + E[r_{t+2} + ...]) - (rt + E[r{t+1} + r{t+2} + ...]) = (r{t+1} + E[r{t+2} + ...]) - (E[r{t+1} + r_{t+2} + ...]) = Q(a_t,s_t) - V(s_t)

If I understand your code correctly, as a result of the offset, the reward r_t will not cancel and the advantage will be wrong?

danijar commented 2 years ago

It's because the time step alignment is ASR in this code base, i.e. the action of a certain index in the trajectory leads to the state and reward at the same index. The example in the comment at the beginning of actor_loss() may be helpful. The index 0 of the trajectory contains zeros for the action and the start state of the imagination rollout as state. This state is not influenced by any of the imagined actions, so it's not included in the loss. This also means that the value at index 1 depends on the first imagined action, so the corresponding baseline should be the value at index 0. That's why the baseline starts from index 0 but the target from index 1.