Offsets in actor loss calculation

danijar / dreamerv2

Mastering Atari with Discrete World Models

MIT License

886 stars 195 forks source link

Hi Danijar, the critic loss is calculated without the offset identical to how it is stated in the paper. https://github.com/danijar/dreamerv2/blob/52fc568f46d25421fbdd4daf75fddd6feabca8d4/dreamerv2/agent.py#L299-L302

However, for the actor loss there is this offset by 1 (skip first target). Could you explain why this is the case? https://github.com/danijar/dreamerv2/blob/52fc568f46d25421fbdd4daf75fddd6feabca8d4/dreamerv2/agent.py#L272

This is how I imagine the advantage should be calculated (simplified without lambda-target). s_t is the current state of the agent. r_t is the reward of this state and should be ignored, since we are already in the state.

A = target(s_t) - baseline(s_t) = (rt + r{t+1} + E[r_{t+2} + ...]) - (rt + E[r{t+1} + r{t+2} + ...]) = (r{t+1} + E[r{t+2} + ...]) - (E[r{t+1} + r_{t+2} + ...]) = Q(a_t,s_t) - V(s_t)

If I understand your code correctly, as a result of the offset, the reward r_t will not cancel and the advantage will be wrong?

danijar / dreamerv2

Offsets in actor loss calculation #27