MBPO Collapse and SSRL Long-Term Stability

hdadong commented 1 week ago

I followed the instructions in the README to run SSRL and MBPO (Black-box, 4-step) with three seeds each. I observed that MBPO achieved a similar reward growth rate to SSRL, but eventually collapsed in the later stages of training. I'm aware that some papers mention such collapse phenomena in RL during the later phases of training, but I'm unsure if this is related to the plasticity of the neural network. Could this collapse be attributed to the insufficient plasticity of the ensemble model or the SAC actor-critic?

what do you think the underlying reasons for MBPO's collapse during training? If the collapse is indeed due to the neural network, does this suggest that SSRL, which uses a similar neural network architecture, might also experience collapse under certain conditions?

In your paper, you mentioned, "To enhance plasticity, the model, actor, and critic are reset at 10,000 steps," but I missed this mechanism in the code. Did you encounter situations where SSRL collapsed, leading to the decision to implement this reset mechanism?

jake-levy commented 6 days ago

The reason you are initially seeing similar performance between SSRL and MBPO is because I had the wrong hyperpameters in the config file. Crucially, deterministic_in_env was set to false when it should be true. I fixed this along with a couple of other hyperparameters that didn't match the paper in #7; try running it again with the corrected settings.

If deterministic_in_env is set to false, then exploration noise is added to the actions during the "real-world" rollouts. This apparently gives MBPO a performance boost, however we did not want to use a stochastic policy in the real world because we wanted to avoid crazy actions and actuator damage. You will find that MBPO does not do well under this setting because it has to rely on its model for exploration, but the MBPO model doesn't generalize as well as the SSRL model.

We only reset the actor/critic/model for the hardware run (not the sim runs). This was accomplished by the settings in go1.yaml

hdadong commented 6 days ago

Thanks for your correction. That's very helpful for understanding the different between the MBPO and SSRL. I will run it again! I wonder why you reset the actor/critic/model for the hardware run, did you encounter situations where SSRL collapsed, or you find that it will improve the performance?

hdadong commented 5 days ago

I followed your instructions by setting deterministic_in_env to true and clear_model_buffer_after_model_train to false. To observe the effects over longer training steps and reduce the training time, I also set ssrl_model.ensemble_size and ssrl_model.num_elites to 1, and ran MBPO and SSRL with two seeds. I observed a significant difference between MBPO and SSRL, which is quite impressive. However, I also noticed that SSRL became unstable and eventually crashed. This kind of crash could lead to serious consequences in real-world applications.

I wonder if you attribute this instability to the plasticity of the neural network, or if it could be due to the lack of randomness in exploration, causing certain states to remain unexplored and leading to crashes during testing.

In a previous experiment where I set deterministic_in_env to false and clear_model_buffer_after_model_train to true, while keeping ssrl_model.ensemble_size and ssrl_model.num_elites at 1, SSRL remained stable over a longer training period:

下载(1)

Given this, I wonder if clearing the buffer and increasing randomness during policy exploration may contribute to maintaining stability during testing. I would appreciate your insight on whether the neural network reset mechanism could help alleviate instability to some extent when these two mechanisms are disabled during real-world training.

jake-levy commented 3 days ago

Interesting find that you were able to get better MBPO performance than in the paper with ssrl_model.num_elites=1. What did you set the ensemble size to? The original MBPO implementation used an ensemble of 7 and 5 elites, so we just rolled with this.

In our real-world testing, we encountered situations where training collapsed (similar to what you're seeing in sim) and we found that resetting helped.

I think there's a chance that q-function overestimation might be causing this issue. One indirect way to check this could be to look at the sac/critic_loss metric/graph. A more direct (but not currently implemented) way would be to export the actual q-values from sac losses. If this is the case, then I believe resetting might help. Also, lowering ssrl.sac_tau may help stabilize training as well -- this would be the quickest thing to try; I have seen that lowering this will slow training down, but also make it more stable.

As for exploration, there are some settings to control this:

actor_network.max_std can be used to limit the std dev of the policy (during hallucination, and also in rollouts, if deterministic_in_env = false).
ssrl.hallucination_max_std can be used to limit the std of the policy, only during hallucination. If this is set to a negative number, it will use the same value set in actor_network.max_std.
You can also set a lower limit to the std dev of the policy (during hallucination, and also in rollouts, if deterministic_in_env = false) here. This would force more exploration. I have not tried this and it currently is hardcoded in.

hdadong commented 2 days ago

Thanks for your patience in replying to me so much. The original MBPO implementation used an ensemble of 7 and 5 elites, but I set the ensemble size to 1 and the num_elites to 1.

CLeARoboticsLab / ssrl

MBPO Collapse and SSRL Long-Term Stability #6