LOLA Policy Gradient Target Computation

Hello, thank you for open-sourcing the code! :-) The code is really helpful in understanding the papers deeper.

I am interested in LOLA, especially its policy gradient method (lola/train_pg.py). As mentioned in the paper, this implementation shows the actor-critic method.

However, I could not fully understand the target computation code: self.target = self.sample_return + self.next_v (code). According to the reference (chapter 13, page 274, one-step actor-critic pseudocode), I wonder whether the target computation should use the step reward (i.e., reward at timestep t) instead of the return.

Thank you for your time and consideration!

alshedivat / lola

LOLA Policy Gradient Target Computation #8