Closed PabloVD closed 1 month ago
Hello, you should use the rl zoo and save/load the replay buffer too.
Probably a duplicate of https://github.com/DLR-RM/stable-baselines3/issues/435 and others
But is this behavior expected in the SAC implementation of sb3? What do you mean with "save/load the replay buffer too"? Thanks!
try this :)
import gymnasium as gym
from stable_baselines3 import SAC
env = gym.make("Humanoid-v4")
model = SAC("MlpPolicy", env)
model.learn(total_timesteps=100_000)
model.save("my_model")
model.save_replay_buffer("my_buffer.pkl")
model = SAC.load("my_model", env=env)
model.load_replay_buffer("my_buffer.pkl")
model.learn(total_timesteps=10_000)
@qgallouedec thanks for your answer! But even saving and loading the buffer, when resuming training the mean reward starts from the low initial value. Is that expected?
The orange line is the second run, loading model and buffer from the first run (pink line).
training the mean reward starts from the low initial value.
you should probably set the learning starts (warmup) parameter to zero after loading.
What do you mean with "save/load the replay buffer too"?
And you should learn more about SAC (we have good resource linked in our doc).
🐛 Bug
I'm training a SAC policy in a Mujoco's Humanoid environment for some iterations. After finishing training, I save the model, to resume training later.
However, when restarting training again with the loaded model, the episode reward mean starts from a low value again (as can be seen in three consecutive runs in the image below), instead of presenting similar values to the end of the previous training. Is this maybe indicating that some part of the model was not properly saved?
This behavior did not occur with PPO, where restarting training with a pretrained model showed mean rewards similar to those at the end of the previous training,
To Reproduce
Relevant log output / Error message
No response
System Info
Checklist