facebookresearch / mbrl-lib

Library for Model Based RL
MIT License
952 stars 154 forks source link

MBPO cannot work on HumanoidTruncatedObsEnv and original Humanoid Env[Bug] #138

Closed jity16 closed 2 years ago

jity16 commented 2 years ago

Steps to reproduce

  1. I tried to run MBPO on HumanoidTruncatedObsEnv with the default parameters in this repo but the final reward is around 180(seems like random policy and not work)
  2. I tried to run MBPO on original Humanoid env(without truncated obs) and still cannot work

and I have tried different seeds and they all cannot work

Observed Results

image

Expected Results

jity16 commented 2 years ago

I think why MBPO cannot work may be caused by the third-party SAC used in this lib. I've tried to run the third-party SAC( only the SAC algorithm ) on gym___Humanoid-v2 environment, and I found critic loss of this sac explodes! I think this third-party SAC maybe not be implemented correctly. It's different from the results in another sac repo(https://github.com/pranz24/pytorch-soft-actor-critic). (1) this is the result (reward and critic loss from another sac repo): alt textalt text

(2) And this is the result from the third-party sac used in this repo: alt textalt text

Thanks for considering my issue!

Roythuly commented 2 years ago

I have the same problem in humanoid with MBPO. And I find the third party of SAC implemention in this repo (https://github.com/denisyarats/pytorch_sac), which is not satisfied in humanoid as: image

I also try the repo (https://github.com/pranz24/pytorch-soft-actor-critic) and I get this result: image as the result of @jity16 . Thus, I think that the problem of MBPO in humanoid may from the implemention of SAC.

luisenp commented 2 years ago

Hi @jity16 and @Roythuly. Thanks a lot for looking into this! This is a known issue, as mentioned in our paper, but I haven't yet tried changing the SAC implementation, so this is a good lead!

I'll try to test this with the MBPO into this as soon as possible. Unfortunately, everyone in my household is currently down with COVID (including me), so I'm taking a few days off to rest and get this out of the way. Hopefully will be able to work on this in 1-2 weeks.

jity16 commented 2 years ago

I'm very sorry to hear that. Take good care of yourself and wish you recover soon~

luisenp commented 2 years ago

Hi @jity16. I started worked on this today, and have already an implementation with the other SAC working, but I haven't tested it yet. You can take a look in #142, if you are curious about it. It's very very likely not working yet, and I won't have time to pick it up until Friday, but just wanted to update this issue on the current progress. Thanks!

Roythuly commented 2 years ago

Thank you so much @luisenp and wish you and your family better sooner! I would like to try this repo. Thanks for your help!

luisenp commented 2 years ago

Hi @jity16 @Roythuly. I updated the new code a bit and fixed some bugs (#142). So far I've tested it with InvertedPendulum-v2 and it works well. I just launched a run with 10 seeds of HalfCheetah-v2 (which used to fail for some seeds), and will report results later. In the mean time, feel free to test again with Humanoid to see if it works well for you. Thanks!

luisenp commented 2 years ago

Results on HalfCheetah-v2 are inconclusive and still having exploding losses in Humanoid. However, I noticed that plain SAC (no model) also fails when using automatic entropy tuning, which to my understanding is the default for MBPO, but not for pranz24's repo. So, I'll try running both cheetah and humanoid w/o entropy tuning and see if things improve.

cvoelcker commented 2 years ago

Hi, I did similar experiments before the winter break (didn't get the time to write a comment here) and I found that one of the potential culprits for this is different layer initialization in the used SAC library (compared to i.e. stable-baselines). Colleagues have also found that NN initialization can have fairly large effects on the performance of SAC, so this might be an additional lead.

The Yarats version uses if isinstance(m, nn.Linear): nn.init.orthogonal_(m.weight.data) and the Pranz version uses the standrad glorot/xavier initilization.

luisenp commented 2 years ago

HI @cvoelcker, thanks a lot for the comment. I've also seen model initialization to play an important role in model predictions as well.

For what is worth, Pranz's SAC works well for me in Humanoid with the right parameter settings, so in principle it should work with MBPO. However, there seems to be some strange interaction with the model predictions that causes exploding losses when using it in MBPO.

If time allows, my plan is to play around today/tomorrow with adding some amount of real data to SAC's buffer in MBPO, and also compare stats of model predictions against the real data.

Also, I'm curious if @jannerm ever experienced similar issues, and if he'd have any pointers here.

luisenp commented 2 years ago

Hi all. As it turns out, I just found a silly bug that was probably the root cause of the issue. My make_env utility was setting the wrong termination function for Humanoid, and the model simulator ended using Ant's instead. I fixed this and launched a run a few hours ago and seems to be learning (R=850 so far, ~130k steps).

You can check the latest code in #142 if you want to try it out. I'm running experiments with 10 seeds on a few domains, and It's still unclear whether Pranz's library results in better performance, so I'll wait to have a clearer picture before merging that branch into main.

RajGhugare19 commented 2 years ago

Hello everyone! Does anyone know how MBPO (mbrl-lib or author's or any other implementation) performs on standard Ant-v2 and Humanoid-v2 tasks, without truncated observations that is. Thank you.

bamos commented 2 years ago

Hello everyone! Does anyone know how MBPO (mbrl-lib or author's or any other implementation) performs on standard Ant-v2 and Humanoid-v2 tasks, without truncated observations that is. Thank you.

I quickly tried SVG (https://arxiv.org/abs/2008.12775) on the versions without truncated observations. The IMU measurements were much more difficult to predict with the learned dynamics and in that setting, made it more likely for the policy to find wrong IMU predictions that gave unreasonably high and wrong reward/value estimates

natolambert commented 2 years ago

IMU measurements should NOT be used in standard one-step models. I feel like there is such low hanging fruit here to make some model-based methods work better. We should add in some sort of integrator / separate prediction coupling when compared to velocities or positions... hmm. Feel free to ping me if anyone wants to investigate this or has clear info on the state-spaces for these environments and I can play with it.

(obviously depends on the noise level, but real world IMUs are bizarre things)

RajGhugare19 commented 2 years ago

I quickly tried SVG (https://arxiv.org/abs/2008.12775) on the versions without truncated observations. The IMU measurements were much more difficult to predict with the learned dynamics and in that setting, made it more likely for the policy to find wrong IMU predictions that gave unreasonably high and wrong reward/value estimates

Thanks for trying this out.

luisenp commented 2 years ago

Update: Here are the results on humanoid (truncated obs) with 10 seeds, after tuning. I also get good results in all 6 environments of Figure 2 in the original paper, but I think I can still tune Ant and Walker a bit more. After that's done, I'll commit the best configs to #142, merge, and close this issue. Thanks everyone!

image

luisenp commented 2 years ago

Fixed by #142 :)

shenao-zhang commented 2 years ago

I think why MBPO cannot work may be caused by the third-party SAC used in this lib. I've tried to run the third-party SAC( only the SAC algorithm ) on gym___Humanoid-v2 environment, and I found critic loss of this sac explodes! I think this third-party SAC maybe not be implemented correctly. It's different from the results in another sac repo(https://github.com/pranz24/pytorch-soft-actor-critic). (1) this is the result (reward and critic loss from another sac repo): alt textalt text

(2) And this is the result from the third-party sac used in this repo: alt textalt text

Thanks for considering my issue!

Hi @jity16 did you figure out why the critic loss explodes? I'm recently implementing BPTT and facing similar problems. Sorry for commenting on the closed issue.