DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
8.77k stars 1.67k forks source link

[Question - Theory] PPO clipping/performance collapse questions #1616

Closed verbose-void closed 1 year ago

verbose-void commented 1 year ago

❓ Question

I understand that the main contribution of PPO is the clipping mechanism -- where the gradients are clipped wrt to the advantage ratio determined by the prev and cur log probs for the policy when the rollout was gathered and at the train step (https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/ppo/ppo.py#L221).

I also understand that, since this clipping occurs such that there's no actual hard constraint on how far the policy can venture from the previous -- it's just not incentivized to stray too far based on the clip value/epsilon.

What I'm having a hard time understanding is why the policy is still allowed to stray further and further away. and it also doesn't make much sense to me that, even though this pseudo-regularization step is applied, the policy still finds it's way further and further away -- particularily during the middle parts of training when the KL div gets excited in general.

Another part to this question is, why might the policy gradient loss increase over time, whereas the episode reward means still tend upwards? I thought that was supposed to be controlled by the clip mechanism.

Sorry for this question going in many different directions, I'm having some hard time getting stability in my PPO trainings on my custom environment and am looking to different resources for helping me stabilize it.

I'm experiencing a lot of performance collapse that are slow to be healed, and it's confusing as to why. I've tweaked most parameters and gotten much much more stable, but it still predictably -- usually around 20% of the way through my iterations, has a huge performance drop. I've added schedules for the learning rate and also clip fraction and tuned the HPs/architectures according to the different metrics and from my experiences with other RL/ML algos.

I can't share too much about what I'm training/a lot of the details, but does anyone have any general advice/resources/questions for addressing this?

Checklist

araffin commented 1 year ago

Hello,

since this clipping occurs such that there's no actual hard constraint on how far the policy can venture from the previous

Yes, clipping is simple to implement but doesn't prevent the policy to change a lot. That's why we have a target kl divergence parameter that helps a bit: https://github.com/DLR-RM/stable-baselines3/blob/5abd50a853e0f667f48ae10769f478c4972eda35/stable_baselines3/ppo/ppo.py#L264-L268

A better example is probably TRPO (which is actually usually competitive with PPO, but harder to implement) that do a line search and use a hard constrain: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/blob/35f06254ba3a470e8bbe1dff0bfc6f319ee2431a/sb3_contrib/trpo/trpo.py#L336-L351

with PPO, you can also play with different variables to avoid catastrophic drop in performance (see PPO ICLR blog post):

why might the policy gradient loss increase over time, whereas the episode reward means still tend upwards?

Policy gradient loss value doesn't have much meaning usually (see spinning up course), what matters is indeed episode return and also sanity metrics like explained variance.

I can't share too much about what I'm training/a lot of the details, but does anyone have any general advice/resources/questions for addressing this?

I would also recommend watching RL Tips and Tricks: https://www.youtube.com/watch?v=Ikngt0_DXJg and also start simple.

verbose-void commented 1 year ago

thanks!