DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
8.38k stars 1.61k forks source link

[Question] Discontinuous reward training curve #1898

Closed JaimeParker closed 2 months ago

JaimeParker commented 2 months ago

❓ Question

I'm using vecenv (n_envs=100) and SAC algorithm, and I met some discontinuous reward training curve.

屏幕截图 2024-04-18 134710

especially from 60m to 80m, there was a huge increase without any fluctuation, is this normal?

屏幕截图 2024-04-18 135837

model = SAC("MlpPolicy",
                    vec_env,
                    verbose=1,
                    tensorboard_log="./sac_tensorboard_log",
                    buffer_size=int(1e6),
                    gamma=0.98)

Checklist

qgallouedec commented 2 months ago

What environment do you use? What is the max len?

JaimeParker commented 2 months ago

@qgallouedec Using a customized gym env, and the max len each episode is 200. My task usually takes 40 steps to reach a good result so I changed gamma to 0.98.

here is a part of training script:

    # Simple or Stochastic
    mode = "Stochastic"
    # PPO, RecurrentPPO or SAC
    RL_algorithm = "SAC"

    vec_env = make_vec_env(QuadrotorStochasticEnv, n_envs=100)
    vec_env.env_method("set_max_episode_length", 200)
    vec_env.env_method("set_mode", "train")

    if RL_algorithm == "PPO":
        file_name = ""
        abs_zip_path, model_name = get_model_path(filename=file_name)
        model = PPO.load(abs_zip_path, env=vec_env, verbose=1, tensorboard_log="./ppo_tensorboard_log")
    elif RL_algorithm == "SAC":
        model = SAC("MlpPolicy",
                    vec_env,
                    verbose=1,
                    tensorboard_log="./sac_tensorboard_log",
                    buffer_size=int(1e6),
                    gamma=0.98)
    else:
        model = RecurrentPPO("MlpLstmPolicy", vec_env, verbose=1)

    model.learn(total_timesteps=int(3e8), progress_bar=True)

and also some of the output info

----------------------------------
| rollout/           |           |
|    ep_len_mean     | 44.7      |
|    ep_rew_mean     | -163      |
| time/              |           |
|    episodes        | 6863360   |
|    fps             | 8300      |
|    time_elapsed    | 36141     |
|    total_timesteps | 299998000 |
| train/             |           |
|    actor_loss      | 80.8      |
|    critic_loss     | 130       |
|    ent_coef        | 0.128     |
|    ent_coef_loss   | -0.00772  |
|    learning_rate   | 0.0003    |
|    n_updates       | 2999978   |
----------------------------------
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 44.7      |
|    ep_rew_mean     | -167      |
| time/              |           |
|    episodes        | 6863364   |
|    fps             | 8300      |
|    time_elapsed    | 36141     |
|    total_timesteps | 299998200 |
| train/             |           |
|    actor_loss      | 86        |
|    critic_loss     | 113       |
|    ent_coef        | 0.128     |
|    ent_coef_loss   | 0.302     |
|    learning_rate   | 0.0003    |
|    n_updates       | 2999980   |
----------------------------------
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 45.2      |
|    ep_rew_mean     | -164      |
| time/              |           |
|    episodes        | 6863368   |
|    fps             | 8300      |
|    time_elapsed    | 36141     |
|    total_timesteps | 299998400 |
| train/             |           |
|    actor_loss      | 82.7      |
|    critic_loss     | 45.6      |
|    ent_coef        | 0.128     |
|    ent_coef_loss   | -0.0346   |
|    learning_rate   | 0.0003    |
|    n_updates       | 2999982   |
----------------------------------
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 45.6      |
|    ep_rew_mean     | -162      |
| time/              |           |
|    episodes        | 6863372   |
|    fps             | 8300      |
|    time_elapsed    | 36141     |
|    total_timesteps | 299998600 |
| train/             |           |
|    actor_loss      | 81.8      |
|    critic_loss     | 62.7      |
|    ent_coef        | 0.128     |
|    ent_coef_loss   | 0.766     |
|    learning_rate   | 0.0003    |
|    n_updates       | 2999984   |
----------------------------------
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 45.9      |
|    ep_rew_mean     | -153      |
| time/              |           |
|    episodes        | 6863376   |
|    fps             | 8300      |
|    time_elapsed    | 36141     |
|    total_timesteps | 299998800 |
| train/             |           |
|    actor_loss      | 89.2      |
|    critic_loss     | 41.6      |
|    ent_coef        | 0.128     |
|    ent_coef_loss   | 0.569     |
|    learning_rate   | 0.0003    |
|    n_updates       | 2999986   |
----------------------------------
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 45.8      |
|    ep_rew_mean     | -156      |
| time/              |           |
|    episodes        | 6863380   |
|    fps             | 8300      |
|    time_elapsed    | 36141     |
|    total_timesteps | 299999000 |
| train/             |           |
|    actor_loss      | 80.4      |
|    critic_loss     | 73.3      |
|    ent_coef        | 0.128     |
|    ent_coef_loss   | -0.194    |
|    learning_rate   | 0.0003    |
|    n_updates       | 2999988   |
----------------------------------
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 45.1      |
|    ep_rew_mean     | -155      |
| time/              |           |
|    episodes        | 6863384   |
|    fps             | 8300      |
|    time_elapsed    | 36141     |
|    total_timesteps | 299999200 |
| train/             |           |
|    actor_loss      | 79.2      |
|    critic_loss     | 59.3      |
|    ent_coef        | 0.128     |
|    ent_coef_loss   | 0.181     |
|    learning_rate   | 0.0003    |
|    n_updates       | 2999990   |
----------------------------------
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 44.7      |
|    ep_rew_mean     | -155      |
| time/              |           |
|    episodes        | 6863388   |
|    fps             | 8300      |
|    time_elapsed    | 36141     |
|    total_timesteps | 299999300 |
| train/             |           |
|    actor_loss      | 81.1      |
|    critic_loss     | 48.5      |
|    ent_coef        | 0.128     |
|    ent_coef_loss   | -0.04     |
|    learning_rate   | 0.0003    |
|    n_updates       | 2999991   |
----------------------------------
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 44.7      |
|    ep_rew_mean     | -155      |
| time/              |           |
|    episodes        | 6863392   |
|    fps             | 8300      |
|    time_elapsed    | 36141     |
|    total_timesteps | 299999400 |
| train/             |           |
|    actor_loss      | 77.1      |
|    critic_loss     | 107       |
|    ent_coef        | 0.128     |
|    ent_coef_loss   | -0.659    |
|    learning_rate   | 0.0003    |
|    n_updates       | 2999992   |
----------------------------------
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 43.4      |
|    ep_rew_mean     | -169      |
| time/              |           |
|    episodes        | 6863396   |
|    fps             | 8300      |
|    time_elapsed    | 36141     |
|    total_timesteps | 299999500 |
| train/             |           |
|    actor_loss      | 84.7      |
|    critic_loss     | 47.6      |
|    ent_coef        | 0.128     |
|    ent_coef_loss   | 0.0928    |
|    learning_rate   | 0.0003    |
|    n_updates       | 2999993   |
----------------------------------
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 43.8      |
|    ep_rew_mean     | -170      |
| time/              |           |
|    episodes        | 6863400   |
|    fps             | 8300      |
|    time_elapsed    | 36141     |
|    total_timesteps | 299999600 |
| train/             |           |
|    actor_loss      | 84.2      |
|    critic_loss     | 54.3      |
|    ent_coef        | 0.128     |
|    ent_coef_loss   | -0.148    |
|    learning_rate   | 0.0003    |
|    n_updates       | 2999994   |
----------------------------------
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 42        |
|    ep_rew_mean     | -170      |
| time/              |           |
|    episodes        | 6863404   |
|    fps             | 8300      |
|    time_elapsed    | 36141     |
|    total_timesteps | 299999800 |
| train/             |           |
|    actor_loss      | 83.1      |
|    critic_loss     | 83.4      |
|    ent_coef        | 0.129     |
|    ent_coef_loss   | 0.247     |
|    learning_rate   | 0.0003    |
|    n_updates       | 2999996   |
----------------------------------
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 41        |
|    ep_rew_mean     | -172      |
| time/              |           |
|    episodes        | 6863408   |
|    fps             | 8300      |
|    time_elapsed    | 36141     |
|    total_timesteps | 300000000 |
| train/             |           |
|    actor_loss      | 86.5      |
|    critic_loss     | 136       |
|    ent_coef        | 0.129     |
|    ent_coef_loss   | 0.482     |
|    learning_rate   | 0.0003    |
|    n_updates       | 2999998   |
----------------------------------
 100% ━━━━━━━━━━━━━━━━━━ 300,000,000/300,00… [ 10:02:21 < 0:00:00 , 8,326 it/s ]
Training process finished.
Training duration: 10:02:22.34
Model name: 2024-04-18_06-49
before saving:  Mean reward per episode: -165.8153335 , std of reward per episode 107.81515792616958

Process finished with exit code 0
qgallouedec commented 2 months ago

Can you share the explanation?

JaimeParker commented 2 months ago

@qgallouedec sorry about this.

Still unsure about the consequences, but there are a few possible reasons.

  1. using a large buffer size. The default buffer size is 1e6 and we know that for each env (when using vec env) the buffer size is buffer size / n_envs. Although increasing the buffer size for vec env is not necessary (see 1885), but large buffer size might cause this reward discontinuous.
  2. too many envs. This discontinuous didn't happen when I was using 50 envs, but became frequent for 100 envs.
  3. too random environment. I was using a customized quadrotor env, whose initial position, velocity, attitude and thrust are quite randomized. But I think env might not be the main reason because this discontinuity happens sometimes, not usually.

But this discontinuity seems to have little influence on the outcome, so I decided to leave it for now. Thanks.