[Question] What is the real intention for reward scaling with running variance of discounted rewards?

DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.

https://stable-baselines3.readthedocs.io

MIT License

9.07k stars 1.7k forks source link

[Question] What is the real intention for reward scaling with running variance of discounted rewards? #1165

Closed MagiFeeney closed 1 year ago

MagiFeeney commented 1 year ago

❓ Question

It confuses me a lot that using statistics of discounted rewards to rescale another quantity - reward. That seems a default choice for PPO. Is there any intuition for interpreting this choice? Why not just use single reward's running variance to achieve that? And what I can conclude is that its effect is pretty like learning rate annealing, or something I'm missing?

Checklist

[ ] I have checked that there is no similar issue in the repo
[X] I have read the documentation
[X] If code there is, it is minimal and working
[X] If code there is, it is formatted using the markdown code blocks for both code and stack traces.

araffin commented 1 year ago

Hello,

I have checked that there is no similar issue in the repo

Probably a duplicate of https://github.com/DLR-RM/stable-baselines3/issues/348#issuecomment-795147337

Why not just use single reward's running variance to achieve that?

you can also give that a try, I would be happy to have comparison (but I don't think it will make a big difference, the main thing is to scale the reward and return to make learning the value function easier).

MagiFeeney commented 1 year ago

Those would be helpful, I will check it out! I am not confident to conclude that using single reward would be worse, but I have overwritten the normalize_reward function with creating another RunningMeanStd rew_rms, results show that it doesn't perform well, so I quickly skipped this choice.