DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
8.35k stars 1.6k forks source link

[Question] About log of policy_gradient_loss #1943

Closed d505 closed 2 weeks ago

d505 commented 3 weeks ago

❓ Question

Hi!

I have a question about ppo's policy_gradient_loss log. The following part https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/ppo/ppo.py#L229-L231

Am I correct in understanding that policy_gradient_loss generally gets smaller as we learn? (It is a loss function and is negative, so in the direction away from 0) I thought the sum of the advantage would be greater because gradually only good actions would be selected. Therefore, I am not sure which value it will converge to, but I expect it to go more negative than 0.

I couldn't figure it out from the following questions. https://github.com/DLR-RM/stable-baselines3/issues/602

The reason for my question is that I was doing an experiment and it was converging to 0. image

Checklist

giangbang commented 3 weeks ago

It should always be around zero, because the code intentionally normalizes the advantage to have zero mean https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/ppo/ppo.py#L223 Additionally, I think it is bad to interpret policy_gradient_loss value as a "loss", policy_gradient_loss is just a convenient way to calculate the policy gradient by utilizing the automatic differentiation of pytorch, which is derived from the theory. This generally has little to do as a "performance metric", to see how the algorithm is learning, look at evaluated rewards.

d505 commented 3 weeks ago

Thanks for your reply @giangbang !

I understand that it is near zero. And, I understand that the performance evaluation of the task and the value of policy_gradient_loss are not the same.

I also thought that if policy_gradient_loss did not converge to some low value, then learning was not being done. For example, if the episode average reward stops updating. Is that correct?

giangbang commented 2 weeks ago

As I said, policy_gradient_loss is just a convenient way to calculate the gradient of the policy, so its values does not necessarily reflect progress in the learning. What does it mean by "a convenient way"?. To update neural networks, we will need gradient, but what is the gradient in reinforcement learning? Unlike other learning settings, RL does not have a "loss function"; it optimizes the cumulative rewards directly instead. The question is how to calculate the gradient of cumulative rewards (this might be harder than it sounds; you simulate a bunch of games, collect the rewards in all those games, then how to calculate the gradient from these rewards, they are just scalars and not differentiable anyway); then there is a theorem to do that called policy gradient theorem. Looking at the formulation of policy gradient theorem, you will see something like this image That is, the gradient that we are looking for equals the average of Q values multiplied by the score function of the policy (just a fancy way to call grad of log). At this point, it should be obvious at how we implement this formulation in pytorch with something like this:

Q = estimate_q_val()
loss = log(action_prob) * Q
loss.backward()

The actual implementation use advantage instead of Q value, advantage = Q - average(Q), that's why you see the advantage equal zero. This difference does not change the expected gradient, but it reduces the gradient variance. But does this loss reflects anything? No. It is just a way to calculate gradient using an automatic differentiation library.

d505 commented 2 weeks ago

The objective function would be as follows. image I thought the following equation was the gradient after the transformation of this equation. image Therefore, I thought that policy_loss would also show the same objective function maximization as the usual machine learning algorithms. I thought it represented maximizing the expected value of the sum of the rewards.

giangbang commented 2 weeks ago

@d505 It is more subtle than you think. The things here is that two functions can have the same gradient, but their values are not necessarily the same. If it still bothers you, try to transform yourself E[Q*log(p)] to see if it can go back to something like the total rewards. In any case, it seems this discussion is straying from stable-baselines3.

d505 commented 2 weeks ago

Thank you very much. I will think about it.