Closed supersglzc closed 2 years ago
HI, @supersglzc I think I have SAC examples in my IG fork, but never tried to find good hyper parameters. But better to ask @ViktorM who is the owner of the IG. For the value: 1) denormalize value output during trajectory gathering. 2) get statistics from both predicted values and returns and normalize them. 3) train value function with normalized data from step 2.
Thanks for the clarification @Denys88.
accidentally closed it. reopened :) Could you let me know if you make SAC work?
Hi, I tried it on the original IG repo and it seems like I have to do a lot of modifications. Should I test it on your IG fork?
Also, I would like to figure out how SAC utilizes input normalization. I didn't find the usage in SAC lol (no calls on obs_norm).
Please try my fork with latest master. If not Ill have some time on weekend to try :) Looks like obs normalization in SAC was removed during the last refactoring. Ill fix it.
Thanks Ill get back to you after trying!
- We cannot normalize single reward. We can scale it but we cannot shift, only cumulative reward. For example if you take reward 0 each frame and you decided to add +1 it will make a huge difference.
:1st_place_medal: I was surprised by how effective the value normalization trick is. So can we do the value normalization on SAC too?
HumanoidPPO.zip I ran it with num_envs=64. here is SAC config which works for me. At least I got >4k reward in this pr https://github.com/Denys88/rl_games/pull/186.
Thanks for the patience and I will try it tmr. FYI MAPPO paper also mentioned the value normalization trick.
@supersglzc they did it without collecting running mean std. But it worked them :) btw I forgot to mention - I still do runs to find best hyperparams for SAC. Didnt succed yet. that ppo parameters are too good )
Yes PPO achieved remarkable performance on those tasks. For SAC, without considering the sample efficiency, it also works with 2,048 envs.
Thanks for your help and Ill close the issue:)
Hi, thanks for the amazing work!
I am wondering how important the value normalization is? When I disable the value normalization in some tasks, especially the ShadowHand, the PPO agent doesn't work anymore. I looked up the code and it seems to me that it normalizes the returns and predicted (old) values before calculating the loss. However, the (new) value output by the model is not normalized (due to the unnorm function). So why does it work or did I misunderstand something?
Also, if I want to test Isaac Gym with the SAC codes, can I achieve it using RL games?