SAC model not properly saved

PabloVD commented 2 months ago

🐛 Bug

I'm training a SAC policy in a Mujoco's Humanoid environment for some iterations. After finishing training, I save the model, to resume training later.

However, when restarting training again with the loaded model, the episode reward mean starts from a low value again (as can be seen in three consecutive runs in the image below), instead of presenting similar values to the end of the previous training. Is this maybe indicating that some part of the model was not properly saved?

This behavior did not occur with PPO, where restarting training with a pretrained model showed mean rewards similar to those at the end of the previous training,

Captura de pantalla de 2024-04-30 16-58-38

To Reproduce

# See more here https://stable-baselines3.readthedocs.io/en/master/guide/examples.html
import gymnasium as gym
from stable_baselines3 import PPO, SAC
import os
import warnings
from stable_baselines3.common.callbacks import CheckpointCallback

warnings.filterwarnings('ignore', category=UserWarning, message='TypedStorage is deprecated')

# Configuration
name_exp = "SAC"
name_model = "models/model_"+name_exp
total_timesteps = 1000000
device = "cuda"

checkpoint_callback = CheckpointCallback(save_freq=50000, save_path='./callbacks/', name_prefix=name_exp)

env = gym.make("Humanoid-v4", render_mode="rgb_array")

if os.path.exists(name_model+".zip"):
    print("Loading previous model:",name_model+".zip")
    model = SAC.load(name_model, env=env, device=device)
else:
    model = SAC("MlpPolicy", env, verbose=1, tensorboard_log="./logs", device=device)
print(model.policy)

model.learn(total_timesteps=total_timesteps, tb_log_name=name_exp, callback=checkpoint_callback )

model.save(name_model)

Relevant log output / Error message

No response

System Info

OS: Linux-5.15.0-105-generic-x86_64-with-glibc2.29 # 115~20.04.1-Ubuntu SMP Mon Apr 15 17:33:04 UTC 2024
Python: 3.8.10
Stable-Baselines3: 2.3.1
PyTorch: 2.0.0+cu117
GPU Enabled: True
Numpy: 1.24.4
Cloudpickle: 3.0.0
Gymnasium: 0.28.1
OpenAI Gym: 0.26.2

Checklist

[X] My issue does not relate to a custom gym environment. (Use the custom gym env template instead)
[X] I have checked that there is no similar issue in the repo
[X] I have read the documentation
[X] I have provided a minimal and working example to reproduce the bug
[X] I've used the markdown code blocks for both code and stack traces.

araffin commented 2 months ago

Hello, you should use the rl zoo and save/load the replay buffer too.

Probably a duplicate of https://github.com/DLR-RM/stable-baselines3/issues/435 and others

PabloVD commented 2 months ago

But is this behavior expected in the SAC implementation of sb3? What do you mean with "save/load the replay buffer too"? Thanks!

qgallouedec commented 2 months ago

try this :)

import gymnasium as gym
from stable_baselines3 import SAC

env = gym.make("Humanoid-v4")

model = SAC("MlpPolicy", env)
model.learn(total_timesteps=100_000)
model.save("my_model")
model.save_replay_buffer("my_buffer.pkl")

model = SAC.load("my_model", env=env)
model.load_replay_buffer("my_buffer.pkl")
model.learn(total_timesteps=10_000)

PabloVD commented 2 months ago

@qgallouedec thanks for your answer! But even saving and loading the buffer, when resuming training the mean reward starts from the low initial value. Is that expected?

The orange line is the second run, loading model and buffer from the first run (pink line).

Captura de pantalla de 2024-05-03 10-43-00

araffin commented 2 months ago

training the mean reward starts from the low initial value.

you should probably set the learning starts (warmup) parameter to zero after loading.

What do you mean with "save/load the replay buffer too"?

And you should learn more about SAC (we have good resource linked in our doc).

DLR-RM / stable-baselines3