[Question] How to avoid SAC to stuck in local minima

❓ Question

I've read 76, and it's a little different.

if during random exploration it finds the goal, then it will work, otherwise, it will be stuck in a local minima

But for me, through the rollout output I can tell that it did find the goal, which means finishing my job, not in an optimal way but accomplished it anyway, which is local minima.

Here is a training curve using SAC:

It found the goal at about 10m, and the reward at that time was about -100. During the last steps, although it had a weak increasing trend, it didn't reach the optimal reward I designed in the end.

In case of lacking total timesteps, I tried larger total timesteps several times (3e8 usually). The reward curves are basically all like below:

always converge quickly and stuck in local minima.

I've tried several methods,

using a negative step reward and adding a negative constant to slow down the training
increasing the ratio between the step reward (reward each step) and sparse reward (reward for finishing the task)

The first method did slow down training, but seemed to have no influence on the final local minima. The second method did improve local minima, but this phenomenon is random and does not necessarily improve every time.

Details:

env: my customized gym env
gamma: 0.98 (my task usually ends at about 40 steps)
using vec env

by adding additional noise to the actions of the behavior policy

I'll give it a go. But the difference is that I can get out of the local minima occasionally.

There is another choice that I wanna try, adding a quadratic term of punishment of time (Since my task requires minimum time), which is:

# in reward function
time_quadratic = -k * current_step  # k > 0

Do you have any experience of getting out of local minima for SAC, thanks.

I also tried PPO, it performed well.

Checklist

[X] I have checked that there is no similar issue in the repo
[X] I have read the documentation
[X] If code there is, it is minimal and working
[X] If code there is, it is formatted using the markdown code blocks for both code and stack traces.

DLR-RM / stable-baselines3

[Question] How to avoid SAC to stuck in local minima #1903

❓ Question

Checklist