Closed MagiFeeney closed 1 year ago
Hello,
I have checked that there is no similar issue in the repo
Probably a duplicate of https://github.com/DLR-RM/stable-baselines3/issues/348#issuecomment-795147337
Why not just use single reward's running variance to achieve that?
you can also give that a try, I would be happy to have comparison (but I don't think it will make a big difference, the main thing is to scale the reward and return to make learning the value function easier).
Those would be helpful, I will check it out! I am not confident to conclude that using single reward would be worse, but I have overwritten the normalize_reward function with creating another RunningMeanStd rew_rms, results show that it doesn't perform well, so I quickly skipped this choice.
❓ Question
It confuses me a lot that using statistics of discounted rewards to rescale another quantity - reward. That seems a default choice for PPO. Is there any intuition for interpreting this choice? Why not just use single reward's running variance to achieve that? And what I can conclude is that its effect is pretty like learning rate annealing, or something I'm missing?
Checklist