alshedivat / lola

Code release for Learning with Opponent-Learning Awareness and variations.
MIT License
145 stars 35 forks source link

LOLA Policy Gradient Target Computation #8

Open dkkim93 opened 5 years ago

dkkim93 commented 5 years ago

Hello, thank you for open-sourcing the code! :-) The code is really helpful in understanding the papers deeper.

I am interested in LOLA, especially its policy gradient method (lola/train_pg.py). As mentioned in the paper, this implementation shows the actor-critic method.

However, I could not fully understand the target computation code: self.target = self.sample_return + self.next_v (code). According to the reference (chapter 13, page 274, one-step actor-critic pseudocode), I wonder whether the target computation should use the step reward (i.e., reward at timestep t) instead of the return.

Thank you for your time and consideration!