danijar / dreamerv2

Mastering Atari with Discrete World Models
https://danijar.com/dreamerv2
MIT License
886 stars 195 forks source link

Question about advantage calculation #60

Open leeacord opened 3 months ago

leeacord commented 3 months ago

Hi,

I have a question regarding the implementation of the advantage calculation. The code snippet is as follows: https://github.com/danijar/dreamerv2/blob/07d906e9c4322c6fc2cd6ed23e247ccd6b7c8c41/dreamerv2/agent.py#L252-L274

advantage = tf.stop_gradient(target[1:] - self._target_critic(seq['feat'][:-2]).mode())

Based on my understanding:

If I understand correctly, the code uses $V{t+1}^{\lambda} - v\xi\left(\hat{z}_t\right)$ to calculate the advantage,

not $Vt^{\lambda} - v{\xi}(\hat{z}_t)$ as stated in the paper?

I am very impressed with the Dreamer series of reinforcement learning algorithms. Thank you for your hard work!