[Bug]: DQN resets exploration rate from saved model

AMR-aa1405465 commented 1 year ago

🐛 Bug

Hello everyone, I appreciate your work. I have a little bit embarrassing problem =|

I have recently encountered an issue while attempting to train, save, and reload my DQN model within a Gymnasium environment with only one environment. The problem lies in the knowledge transfer from the saved to the loaded model. During the initial training phase on (CartPole-v1), I successfully reached a reward of 200 after 100 K timesteps.

However, when I reload the model for further training, I expect to start with a reward of around 200 and retain the hyperparameters set during the initial training (e.g., exploration rate). Unfortunately, this is not happening as expected.

In my case, the learning seems to start from the beginning as I see small rewards only, in addition, the exploration rate gets back to 1, not 0.05 as before.

Unlike DQN, I have tried this feature before on PPO and was working fine.

To fix this issue, I tried the following without effect:

Saving/reloading the replay buffer.
Manually setting the exploration rate on the loaded model to 0.05.
I set the env (i.e., model.set_env()) with dummyVecEnv and normal env.
Used model.set_parameters() instead of model.load()
I saw the different issues about knowledge transfer that people encountered and applied their fixes when applicable (such as https://github.com/DLR-RM/stable-baselines3/issues/29, https://github.com/hill-a/stable-baselines/issues/30, https://github.com/DLR-RM/stable-baselines3/issues/70)

To Reproduce

import gymnasium as gym

from stable_baselines3 import DQN
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.vec_env import DummyVecEnv

env = gym.make("CartPole-v1")

model = DQN("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=200000, log_interval=4)
model.save("dqn_cartpole")
model.save_replay_buffer("dqn_cartpole-buffer")

del model # remove to demonstrate saving and loading
# Here, I observe the last trend of rewards achieved by the agent

model = DQN.load("dqn_cartpole")
k = Monitor(gym.make("CartPole-v1"))
model.set_env(DummyVecEnv([lambda : k]))
model.load_replay_buffer("dqn_cartpole-buffer")
model.learn(total_timesteps=100000, log_interval=4)
# Here, please observe the rewards and the exploration rate of the agent, resets to 1 and rewards drops

Relevant log output / Error message

No response

System Info

OS: Linux-5.19.0-45-generic-x86_64-with-glibc2.35 # 46~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Jun 7 15:06:04 UTC 20
Python: 3.11.3
Stable-Baselines3: 2.0.0
PyTorch: 2.0.1
GPU Enabled: True
Numpy: 1.24.3
Cloudpickle: 2.2.1
Gymnasium: 0.28.1
OpenAI Gym: 0.26.2

Checklist

[X] My issue does not relate to a custom gym environment. (Use the custom gym env template instead)
[X] I have checked that there is no similar issue in the repo
[X] I have read the documentation
[X] I have provided a minimal and working example to reproduce the bug
[X] I've used the markdown code blocks for both code and stack traces.

araffin commented 1 year ago

Hello, you are probably missing the reset_num_timesteps=False parameter (see doc).

Also related, why changing the exploration rate alone doesn't work (you need to change the schedule): https://github.com/DLR-RM/stable-baselines3/issues/735#issuecomment-1047638011

araffin commented 1 year ago

Looks like a duplicate of https://github.com/DLR-RM/stable-baselines3/issues/597#issuecomment-937207471 also related (for running multiple times learn()): https://github.com/DLR-RM/stable-baselines3/issues/957

AMR-aa1405465 commented 1 year ago

Yup, reset_num_timesteps=False did the trick =D Thanks for the help mate

DLR-RM / stable-baselines3