[Question] Discontinuous reward training curve

JaimeParker commented 2 months ago

❓ Question

I'm using vecenv (n_envs=100) and SAC algorithm, and I met some discontinuous reward training curve.

especially from 60m to 80m, there was a huge increase without any fluctuation, is this normal?

屏幕截图 2024-04-18 135837

model = SAC("MlpPolicy",
                    vec_env,
                    verbose=1,
                    tensorboard_log="./sac_tensorboard_log",
                    buffer_size=int(1e6),
                    gamma=0.98)

Checklist

[X] I have checked that there is no similar issue in the repo
[X] I have read the documentation
[x] If code there is, it is minimal and working
[X] If code there is, it is formatted using the markdown code blocks for both code and stack traces.

qgallouedec commented 2 months ago

What environment do you use? What is the max len?

JaimeParker commented 2 months ago

@qgallouedec Using a customized gym env, and the max len each episode is 200. My task usually takes 40 steps to reach a good result so I changed gamma to 0.98.

here is a part of training script:

    # Simple or Stochastic
    mode = "Stochastic"
    # PPO, RecurrentPPO or SAC
    RL_algorithm = "SAC"

    vec_env = make_vec_env(QuadrotorStochasticEnv, n_envs=100)
    vec_env.env_method("set_max_episode_length", 200)
    vec_env.env_method("set_mode", "train")

    if RL_algorithm == "PPO":
        file_name = ""
        abs_zip_path, model_name = get_model_path(filename=file_name)
        model = PPO.load(abs_zip_path, env=vec_env, verbose=1, tensorboard_log="./ppo_tensorboard_log")
    elif RL_algorithm == "SAC":
        model = SAC("MlpPolicy",
                    vec_env,
                    verbose=1,
                    tensorboard_log="./sac_tensorboard_log",
                    buffer_size=int(1e6),
                    gamma=0.98)
    else:
        model = RecurrentPPO("MlpLstmPolicy", vec_env, verbose=1)

    model.learn(total_timesteps=int(3e8), progress_bar=True)

and also some of the output info

----------------------------------
| rollout/           |           |
|    ep_len_mean     | 44.7      |
|    ep_rew_mean     | -163      |
| time/              |           |
|    episodes        | 6863360   |
|    fps             | 8300      |
|    time_elapsed    | 36141     |
|    total_timesteps | 299998000 |
| train/             |           |
|    actor_loss      | 80.8      |
|    critic_loss     | 130       |
|    ent_coef        | 0.128     |
|    ent_coef_loss   | -0.00772  |
|    learning_rate   | 0.0003    |
|    n_updates       | 2999978   |
----------------------------------
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 44.7      |
|    ep_rew_mean     | -167      |
| time/              |           |
|    episodes        | 6863364   |
|    fps             | 8300      |
|    time_elapsed    | 36141     |
|    total_timesteps | 299998200 |
| train/             |           |
|    actor_loss      | 86        |
|    critic_loss     | 113       |
|    ent_coef        | 0.128     |
|    ent_coef_loss   | 0.302     |
|    learning_rate   | 0.0003    |
|    n_updates       | 2999980   |
----------------------------------
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 45.2      |
|    ep_rew_mean     | -164      |
| time/              |           |
|    episodes        | 6863368   |
|    fps             | 8300      |
|    time_elapsed    | 36141     |
|    total_timesteps | 299998400 |
| train/             |           |
|    actor_loss      | 82.7      |
|    critic_loss     | 45.6      |
|    ent_coef        | 0.128     |
|    ent_coef_loss   | -0.0346   |
|    learning_rate   | 0.0003    |
|    n_updates       | 2999982   |
----------------------------------
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 45.6      |
|    ep_rew_mean     | -162      |
| time/              |           |
|    episodes        | 6863372   |
|    fps             | 8300      |
|    time_elapsed    | 36141     |
|    total_timesteps | 299998600 |
| train/             |           |
|    actor_loss      | 81.8      |
|    critic_loss     | 62.7      |
|    ent_coef        | 0.128     |
|    ent_coef_loss   | 0.766     |
|    learning_rate   | 0.0003    |
|    n_updates       | 2999984   |
----------------------------------
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 45.9      |
|    ep_rew_mean     | -153      |
| time/              |           |
|    episodes        | 6863376   |
|    fps             | 8300      |
|    time_elapsed    | 36141     |
|    total_timesteps | 299998800 |
| train/             |           |
|    actor_loss      | 89.2      |
|    critic_loss     | 41.6      |
|    ent_coef        | 0.128     |
|    ent_coef_loss   | 0.569     |
|    learning_rate   | 0.0003    |
|    n_updates       | 2999986   |
----------------------------------
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 45.8      |
|    ep_rew_mean     | -156      |
| time/              |           |
|    episodes        | 6863380   |
|    fps             | 8300      |
|    time_elapsed    | 36141     |
|    total_timesteps | 299999000 |
| train/             |           |
|    actor_loss      | 80.4      |
|    critic_loss     | 73.3      |
|    ent_coef        | 0.128     |
|    ent_coef_loss   | -0.194    |
|    learning_rate   | 0.0003    |
|    n_updates       | 2999988   |
----------------------------------
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 45.1      |
|    ep_rew_mean     | -155      |
| time/              |           |
|    episodes        | 6863384   |
|    fps             | 8300      |
|    time_elapsed    | 36141     |
|    total_timesteps | 299999200 |
| train/             |           |
|    actor_loss      | 79.2      |
|    critic_loss     | 59.3      |
|    ent_coef        | 0.128     |
|    ent_coef_loss   | 0.181     |
|    learning_rate   | 0.0003    |
|    n_updates       | 2999990   |
----------------------------------
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 44.7      |
|    ep_rew_mean     | -155      |
| time/              |           |
|    episodes        | 6863388   |
|    fps             | 8300      |
|    time_elapsed    | 36141     |
|    total_timesteps | 299999300 |
| train/             |           |
|    actor_loss      | 81.1      |
|    critic_loss     | 48.5      |
|    ent_coef        | 0.128     |
|    ent_coef_loss   | -0.04     |
|    learning_rate   | 0.0003    |
|    n_updates       | 2999991   |
----------------------------------
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 44.7      |
|    ep_rew_mean     | -155      |
| time/              |           |
|    episodes        | 6863392   |
|    fps             | 8300      |
|    time_elapsed    | 36141     |
|    total_timesteps | 299999400 |
| train/             |           |
|    actor_loss      | 77.1      |
|    critic_loss     | 107       |
|    ent_coef        | 0.128     |
|    ent_coef_loss   | -0.659    |
|    learning_rate   | 0.0003    |
|    n_updates       | 2999992   |
----------------------------------
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 43.4      |
|    ep_rew_mean     | -169      |
| time/              |           |
|    episodes        | 6863396   |
|    fps             | 8300      |
|    time_elapsed    | 36141     |
|    total_timesteps | 299999500 |
| train/             |           |
|    actor_loss      | 84.7      |
|    critic_loss     | 47.6      |
|    ent_coef        | 0.128     |
|    ent_coef_loss   | 0.0928    |
|    learning_rate   | 0.0003    |
|    n_updates       | 2999993   |
----------------------------------
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 43.8      |
|    ep_rew_mean     | -170      |
| time/              |           |
|    episodes        | 6863400   |
|    fps             | 8300      |
|    time_elapsed    | 36141     |
|    total_timesteps | 299999600 |
| train/             |           |
|    actor_loss      | 84.2      |
|    critic_loss     | 54.3      |
|    ent_coef        | 0.128     |
|    ent_coef_loss   | -0.148    |
|    learning_rate   | 0.0003    |
|    n_updates       | 2999994   |
----------------------------------
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 42        |
|    ep_rew_mean     | -170      |
| time/              |           |
|    episodes        | 6863404   |
|    fps             | 8300      |
|    time_elapsed    | 36141     |
|    total_timesteps | 299999800 |
| train/             |           |
|    actor_loss      | 83.1      |
|    critic_loss     | 83.4      |
|    ent_coef        | 0.129     |
|    ent_coef_loss   | 0.247     |
|    learning_rate   | 0.0003    |
|    n_updates       | 2999996   |
----------------------------------
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 41        |
|    ep_rew_mean     | -172      |
| time/              |           |
|    episodes        | 6863408   |
|    fps             | 8300      |
|    time_elapsed    | 36141     |
|    total_timesteps | 300000000 |
| train/             |           |
|    actor_loss      | 86.5      |
|    critic_loss     | 136       |
|    ent_coef        | 0.129     |
|    ent_coef_loss   | 0.482     |
|    learning_rate   | 0.0003    |
|    n_updates       | 2999998   |
----------------------------------
 100% ━━━━━━━━━━━━━━━━━━ 300,000,000/300,00… [ 10:02:21 < 0:00:00 , 8,326 it/s ]
Training process finished.
Training duration: 10:02:22.34
Model name: 2024-04-18_06-49
before saving:  Mean reward per episode: -165.8153335 , std of reward per episode 107.81515792616958

Process finished with exit code 0

qgallouedec commented 2 months ago

Can you share the explanation?

JaimeParker commented 2 months ago

@qgallouedec sorry about this.

Still unsure about the consequences, but there are a few possible reasons.

using a large buffer size. The default buffer size is 1e6 and we know that for each env (when using vec env) the buffer size is buffer size / n_envs. Although increasing the buffer size for vec env is not necessary (see 1885), but large buffer size might cause this reward discontinuous.
too many envs. This discontinuous didn't happen when I was using 50 envs, but became frequent for 100 envs.
too random environment. I was using a customized quadrotor env, whose initial position, velocity, attitude and thrust are quite randomized. But I think env might not be the main reason because this discontinuity happens sometimes, not usually.

But this discontinuity seems to have little influence on the outcome, so I decided to leave it for now. Thanks.

DLR-RM / stable-baselines3

[Question] Discontinuous reward training curve #1898

❓ Question

Checklist