Khrylx / PyTorch-RL

PyTorch implementation of Deep Reinforcement Learning: Policy Gradient methods (TRPO, PPO, A2C) and Generative Adversarial Imitation Learning (GAIL). Fast Fisher vector product TRPO.
MIT License
1.09k stars 186 forks source link

How are we using rewards in imitation learning? #25

Closed SiddharthSingi closed 3 years ago

SiddharthSingi commented 3 years ago

Hi, these implementations are amazing, thank you for sharing them. I have a question on how, and rather why are we using rewards in imitation learning? https://github.com/Khrylx/PyTorch-RL/blob/d94e1479403b2b918294d6e9b0dc7869e893aa1b/gail/gail_gym.py#L110 In the paper they have mentioned that instead of using the rewards to improve the policy, we use the log of the discriminator value like so (last line before end of for loop): Screenshot (283) Screenshot (282)

As you can see above the policy update uses the log of the Discriminator. Could you please explain why is this term being used instead of the reward?

Khrylx commented 3 years ago

The reward is computed exactly as log of discriminator, as shown in this line: https://github.com/Khrylx/PyTorch-RL/blob/d94e1479403b2b918294d6e9b0dc7869e893aa1b/gail/gail_gym.py#L99

SiddharthSingi commented 3 years ago

Thank you for your response. For anyone else also wondering the same thing. Please check this line as well: https://github.com/Khrylx/PyTorch-RL/blob/d94e1479403b2b918294d6e9b0dc7869e893aa1b/core/agent.py#L42

SiddharthSingi commented 3 years ago

The reward is computed exactly as log of discriminator, as shown in this line: https://github.com/Khrylx/PyTorch-RL/blob/d94e1479403b2b918294d6e9b0dc7869e893aa1b/gail/gail_gym.py#L99

This is exactly how the reward should be calculated, however everywhere it is written as the expectation of the derivative of the log term (See above images). Can you please tell me why the 'minus' sign is removed in the expectation term?

zbzhu99 commented 3 years ago

The reward is computed exactly as log of discriminator, as shown in this line: https://github.com/Khrylx/PyTorch-RL/blob/d94e1479403b2b918294d6e9b0dc7869e893aa1b/gail/gail_gym.py#L99

This is exactly how the reward should be calculated, however everywhere it is written as the expectation of the derivative of the log term (See above images). Can you please tell me why the 'minus' sign is removed in the expectation term?

The D in this code is actually equivalent to the minus D in the original paper.

https://github.com/Khrylx/PyTorch-RL/blob/d94e1479403b2b918294d6e9b0dc7869e893aa1b/gail/gail_gym.py#L125-L126

From the above two lines of code, it can be seen that the discriminator's update goal is to output 1 for the generated data g_o and 0 for the expert data e_o. The goal of the policy update should then be to minimize the output of the discriminator, i.e., to maximize the -log(D(g_o)) reward, which makes the so-called adversarial training.