Question about advantage calculation

Hi,

advantage = tf.stop_gradient(target[1:] - self._target_critic(seq['feat'][:-2]).mode())

Based on my understanding:

seq['feat'] contains time steps from 0 to horizon.
target contains time steps from 0 to horizon-1, since the value at the last step is used as a bootstrap for lambda_return.
Therefore, baseline in Line 271 includes time steps from 0 to horizon-2, and target[1:] includes time steps from 1 to horizon-1.

If I understand correctly, the code uses $V{t+1}^{\lambda} - v\xi\left(\hat{z}_t\right)$ to calculate the advantage,

not $Vt^{\lambda} - v{\xi}(\hat{z}_t)$ as stated in the paper?

I am very impressed with the Dreamer series of reinforcement learning algorithms. Thank you for your hard work!

danijar / dreamerv2