Hello, thank you for open-sourcing the code! :-)
The code is really helpful in understanding the papers deeper.
I am interested in LOLA, especially its policy gradient method (lola/train_pg.py).
As mentioned in the paper, this implementation shows the actor-critic method.
However, I could not fully understand the target computation code:
self.target = self.sample_return + self.next_v (code).
According to the reference (chapter 13, page 274, one-step actor-critic pseudocode), I wonder whether the target computation should use the step reward (i.e., reward at timestep t) instead of the return.
Hello, thank you for open-sourcing the code! :-) The code is really helpful in understanding the papers deeper.
I am interested in LOLA, especially its policy gradient method (lola/train_pg.py). As mentioned in the paper, this implementation shows the actor-critic method.
However, I could not fully understand the target computation code:
self.target = self.sample_return + self.next_v
(code). According to the reference (chapter 13, page 274, one-step actor-critic pseudocode), I wonder whether the target computation should use the step reward (i.e., reward at timestep t) instead of the return.Thank you for your time and consideration!