Value Normalization - Githubissues

supersglzc commented 2 years ago

Hi, thanks for the amazing work!

I am wondering how important the value normalization is? When I disable the value normalization in some tasks, especially the ShadowHand, the PPO agent doesn't work anymore. I looked up the code and it seems to me that it normalizes the returns and predicted (old) values before calculating the loss. However, the (new) value output by the model is not normalized (due to the unnorm function). So why does it work or did I misunderstand something?

Also, if I want to test Isaac Gym with the SAC codes, can I achieve it using RL games?

Denys88 commented 2 years ago

HI, @supersglzc I think I have SAC examples in my IG fork, but never tried to find good hyper parameters. But better to ask @ViktorM who is the owner of the IG. For the value: 1) denormalize value output during trajectory gathering. 2) get statistics from both predicted values and returns and normalize them. 3) train value function with normalized data from step 2.

supersglzc commented 2 years ago

Thanks for the clarification @Denys88.

Could you point out the SAC examples in the IG fork?
I am curious why we are doing the normalization back and forth? Could we just normalize the reward?

Denys88 commented 2 years ago

I think you can take this config as a baseline: https://github.com/Denys88/rl_games/blob/master/rl_games/configs/brax/sac_humanoid.yaml and put it instead HumanoidPPO.yaml ( with a few small changes). In IsaacGym they use train config name as: taskName+PPO.
We cannot normalize single reward. We can scale it but we cannot shift, only cumulative reward. For example if you take reward 0 each frame and you decided to add +1 it will make a huge difference.

Denys88 commented 2 years ago

accidentally closed it. reopened :) Could you let me know if you make SAC work?

supersglzc commented 2 years ago

Hi, I tried it on the original IG repo and it seems like I have to do a lot of modifications. Should I test it on your IG fork?

Also, I would like to figure out how SAC utilizes input normalization. I didn't find the usage in SAC lol (no calls on obs_norm).

Denys88 commented 2 years ago

Please try my fork with latest master. If not Ill have some time on weekend to try :) Looks like obs normalization in SAC was removed during the last refactoring. Ill fix it.

supersglzc commented 2 years ago

Thanks Ill get back to you after trying!

supersglzc commented 2 years ago

I tried SAC on humanoid task and it worked! But it can only get a cumulative reward of 300 after 1-hour training, which matches my own implementation. But in a recent publication it reported that SAC can get a reward of 6,000 on IG humanoid. Is it due to the missing of input normalization?

We cannot normalize single reward. We can scale it but we cannot shift, only cumulative reward. For example if you take reward 0 each frame and you decided to add +1 it will make a huge difference.

Now I understand that we cannot directly subtract mean for the reward. However, in the implementation, we still subtract the mean and divide the variance to normalize the reward right (self.norm_only is False)? The only difference is we doing the normalization with the value. Could you point out some resources for such a kinda value normalization trick?

:1st_place_medal: I was surprised by how effective the value normalization trick is. So can we do the value normalization on SAC too?

Denys88 commented 2 years ago

I'll try to create good yaml file for Humanoid. And working on input normalization for SAC. Value could be added too but there is one place where we add entropy as part of the loss. reward/entropy ration impacts SAC training a lot.
It is a trick which is used only here. But you can try to read about more complex trick from google/deepmind: it is called Popart normalization. And it works worse))) I have another branch where i compared it.

Denys88 commented 2 years ago

HumanoidPPO.zip I ran it with num_envs=64. here is SAC config which works for me. At least I got >4k reward in this pr https://github.com/Denys88/rl_games/pull/186.

supersglzc commented 2 years ago

Thanks for the patience and I will try it tmr. FYI MAPPO paper also mentioned the value normalization trick.

Denys88 commented 2 years ago

@supersglzc they did it without collecting running mean std. But it worked them :) btw I forgot to mention - I still do runs to find best hyperparams for SAC. Didnt succed yet. that ppo parameters are too good )

supersglzc commented 2 years ago

Yes PPO achieved remarkable performance on those tasks. For SAC, without considering the sample efficiency, it also works with 2,048 envs.

Thanks for your help and Ill close the issue:)

Denys88 / rl_games

Value Normalization #182