Closed jity16 closed 2 years ago
I think why MBPO cannot work may be caused by the third-party SAC used in this lib. I've tried to run the third-party SAC( only the SAC algorithm ) on gym___Humanoid-v2 environment, and I found critic loss of this sac explodes! I think this third-party SAC maybe not be implemented correctly. It's different from the results in another sac repo(https://github.com/pranz24/pytorch-soft-actor-critic). (1) this is the result (reward and critic loss from another sac repo):
(2) And this is the result from the third-party sac used in this repo:
Thanks for considering my issue!
I have the same problem in humanoid with MBPO. And I find the third party of SAC implemention in this repo (https://github.com/denisyarats/pytorch_sac), which is not satisfied in humanoid as:
I also try the repo (https://github.com/pranz24/pytorch-soft-actor-critic) and I get this result: as the result of @jity16 . Thus, I think that the problem of MBPO in humanoid may from the implemention of SAC.
Hi @jity16 and @Roythuly. Thanks a lot for looking into this! This is a known issue, as mentioned in our paper, but I haven't yet tried changing the SAC implementation, so this is a good lead!
I'll try to test this with the MBPO into this as soon as possible. Unfortunately, everyone in my household is currently down with COVID (including me), so I'm taking a few days off to rest and get this out of the way. Hopefully will be able to work on this in 1-2 weeks.
I'm very sorry to hear that. Take good care of yourself and wish you recover soon~
Hi @jity16. I started worked on this today, and have already an implementation with the other SAC working, but I haven't tested it yet. You can take a look in #142, if you are curious about it. It's very very likely not working yet, and I won't have time to pick it up until Friday, but just wanted to update this issue on the current progress. Thanks!
Thank you so much @luisenp and wish you and your family better sooner! I would like to try this repo. Thanks for your help!
Hi @jity16 @Roythuly. I updated the new code a bit and fixed some bugs (#142). So far I've tested it with InvertedPendulum-v2 and it works well. I just launched a run with 10 seeds of HalfCheetah-v2 (which used to fail for some seeds), and will report results later. In the mean time, feel free to test again with Humanoid to see if it works well for you. Thanks!
Results on HalfCheetah-v2 are inconclusive and still having exploding losses in Humanoid. However, I noticed that plain SAC (no model) also fails when using automatic entropy tuning, which to my understanding is the default for MBPO, but not for pranz24's repo. So, I'll try running both cheetah and humanoid w/o entropy tuning and see if things improve.
Hi, I did similar experiments before the winter break (didn't get the time to write a comment here) and I found that one of the potential culprits for this is different layer initialization in the used SAC library (compared to i.e. stable-baselines). Colleagues have also found that NN initialization can have fairly large effects on the performance of SAC, so this might be an additional lead.
The Yarats version uses if isinstance(m, nn.Linear): nn.init.orthogonal_(m.weight.data)
and the Pranz version uses the standrad glorot/xavier initilization.
HI @cvoelcker, thanks a lot for the comment. I've also seen model initialization to play an important role in model predictions as well.
For what is worth, Pranz's SAC works well for me in Humanoid with the right parameter settings, so in principle it should work with MBPO. However, there seems to be some strange interaction with the model predictions that causes exploding losses when using it in MBPO.
If time allows, my plan is to play around today/tomorrow with adding some amount of real data to SAC's buffer in MBPO, and also compare stats of model predictions against the real data.
Also, I'm curious if @jannerm ever experienced similar issues, and if he'd have any pointers here.
Hi all. As it turns out, I just found a silly bug that was probably the root cause of the issue. My make_env utility was setting the wrong termination function for Humanoid, and the model simulator ended using Ant's instead. I fixed this and launched a run a few hours ago and seems to be learning (R=850 so far, ~130k steps).
You can check the latest code in #142 if you want to try it out. I'm running experiments with 10 seeds on a few domains, and It's still unclear whether Pranz's library results in better performance, so I'll wait to have a clearer picture before merging that branch into main.
Hello everyone! Does anyone know how MBPO (mbrl-lib or author's or any other implementation) performs on standard Ant-v2 and Humanoid-v2 tasks, without truncated observations that is. Thank you.
Hello everyone! Does anyone know how MBPO (mbrl-lib or author's or any other implementation) performs on standard Ant-v2 and Humanoid-v2 tasks, without truncated observations that is. Thank you.
I quickly tried SVG (https://arxiv.org/abs/2008.12775) on the versions without truncated observations. The IMU measurements were much more difficult to predict with the learned dynamics and in that setting, made it more likely for the policy to find wrong IMU predictions that gave unreasonably high and wrong reward/value estimates
IMU measurements should NOT be used in standard one-step models. I feel like there is such low hanging fruit here to make some model-based methods work better. We should add in some sort of integrator / separate prediction coupling when compared to velocities or positions... hmm. Feel free to ping me if anyone wants to investigate this or has clear info on the state-spaces for these environments and I can play with it.
(obviously depends on the noise level, but real world IMUs are bizarre things)
I quickly tried SVG (https://arxiv.org/abs/2008.12775) on the versions without truncated observations. The IMU measurements were much more difficult to predict with the learned dynamics and in that setting, made it more likely for the policy to find wrong IMU predictions that gave unreasonably high and wrong reward/value estimates
Thanks for trying this out.
Update: Here are the results on humanoid (truncated obs) with 10 seeds, after tuning. I also get good results in all 6 environments of Figure 2 in the original paper, but I think I can still tune Ant and Walker a bit more. After that's done, I'll commit the best configs to #142, merge, and close this issue. Thanks everyone!
Fixed by #142 :)
I think why MBPO cannot work may be caused by the third-party SAC used in this lib. I've tried to run the third-party SAC( only the SAC algorithm ) on gym___Humanoid-v2 environment, and I found critic loss of this sac explodes! I think this third-party SAC maybe not be implemented correctly. It's different from the results in another sac repo(https://github.com/pranz24/pytorch-soft-actor-critic). (1) this is the result (reward and critic loss from another sac repo):
(2) And this is the result from the third-party sac used in this repo:
Thanks for considering my issue!
Hi @jity16 did you figure out why the critic loss explodes? I'm recently implementing BPTT and facing similar problems. Sorry for commenting on the closed issue.
Steps to reproduce
and I have tried different seeds and they all cannot work
Observed Results
Expected Results