Open leeacord opened 3 months ago
Hi,
I have a question regarding the implementation of the advantage calculation. The code snippet is as follows: https://github.com/danijar/dreamerv2/blob/07d906e9c4322c6fc2cd6ed23e247ccd6b7c8c41/dreamerv2/agent.py#L252-L274
advantage = tf.stop_gradient(target[1:] - self._target_critic(seq['feat'][:-2]).mode())
Based on my understanding:
seq['feat']
0
horizon
target
horizon-1
lambda_return
baseline
horizon-2
target[1:]
1
If I understand correctly, the code uses $V{t+1}^{\lambda} - v\xi\left(\hat{z}_t\right)$ to calculate the advantage,
not $Vt^{\lambda} - v{\xi}(\hat{z}_t)$ as stated in the paper?
I am very impressed with the Dreamer series of reinforcement learning algorithms. Thank you for your hard work!
Hi,
I have a question regarding the implementation of the advantage calculation. The code snippet is as follows: https://github.com/danijar/dreamerv2/blob/07d906e9c4322c6fc2cd6ed23e247ccd6b7c8c41/dreamerv2/agent.py#L252-L274
Based on my understanding:
seq['feat']
contains time steps from0
tohorizon
.target
contains time steps from0
tohorizon-1
, since the value at the last step is used as a bootstrap forlambda_return
.baseline
in Line 271 includes time steps from0
tohorizon-2
, andtarget[1:]
includes time steps from1
tohorizon-1
.If I understand correctly, the code uses $V{t+1}^{\lambda} - v\xi\left(\hat{z}_t\right)$ to calculate the advantage,
not $Vt^{\lambda} - v{\xi}(\hat{z}_t)$ as stated in the paper?
I am very impressed with the Dreamer series of reinforcement learning algorithms. Thank you for your hard work!