Closed JaimeParker closed 2 months ago
What environment do you use? What is the max len?
@qgallouedec Using a customized gym env, and the max len each episode is 200. My task usually takes 40 steps to reach a good result so I changed gamma to 0.98.
here is a part of training script:
# Simple or Stochastic
mode = "Stochastic"
# PPO, RecurrentPPO or SAC
RL_algorithm = "SAC"
vec_env = make_vec_env(QuadrotorStochasticEnv, n_envs=100)
vec_env.env_method("set_max_episode_length", 200)
vec_env.env_method("set_mode", "train")
if RL_algorithm == "PPO":
file_name = ""
abs_zip_path, model_name = get_model_path(filename=file_name)
model = PPO.load(abs_zip_path, env=vec_env, verbose=1, tensorboard_log="./ppo_tensorboard_log")
elif RL_algorithm == "SAC":
model = SAC("MlpPolicy",
vec_env,
verbose=1,
tensorboard_log="./sac_tensorboard_log",
buffer_size=int(1e6),
gamma=0.98)
else:
model = RecurrentPPO("MlpLstmPolicy", vec_env, verbose=1)
model.learn(total_timesteps=int(3e8), progress_bar=True)
and also some of the output info
----------------------------------
| rollout/ | |
| ep_len_mean | 44.7 |
| ep_rew_mean | -163 |
| time/ | |
| episodes | 6863360 |
| fps | 8300 |
| time_elapsed | 36141 |
| total_timesteps | 299998000 |
| train/ | |
| actor_loss | 80.8 |
| critic_loss | 130 |
| ent_coef | 0.128 |
| ent_coef_loss | -0.00772 |
| learning_rate | 0.0003 |
| n_updates | 2999978 |
----------------------------------
----------------------------------
| rollout/ | |
| ep_len_mean | 44.7 |
| ep_rew_mean | -167 |
| time/ | |
| episodes | 6863364 |
| fps | 8300 |
| time_elapsed | 36141 |
| total_timesteps | 299998200 |
| train/ | |
| actor_loss | 86 |
| critic_loss | 113 |
| ent_coef | 0.128 |
| ent_coef_loss | 0.302 |
| learning_rate | 0.0003 |
| n_updates | 2999980 |
----------------------------------
----------------------------------
| rollout/ | |
| ep_len_mean | 45.2 |
| ep_rew_mean | -164 |
| time/ | |
| episodes | 6863368 |
| fps | 8300 |
| time_elapsed | 36141 |
| total_timesteps | 299998400 |
| train/ | |
| actor_loss | 82.7 |
| critic_loss | 45.6 |
| ent_coef | 0.128 |
| ent_coef_loss | -0.0346 |
| learning_rate | 0.0003 |
| n_updates | 2999982 |
----------------------------------
----------------------------------
| rollout/ | |
| ep_len_mean | 45.6 |
| ep_rew_mean | -162 |
| time/ | |
| episodes | 6863372 |
| fps | 8300 |
| time_elapsed | 36141 |
| total_timesteps | 299998600 |
| train/ | |
| actor_loss | 81.8 |
| critic_loss | 62.7 |
| ent_coef | 0.128 |
| ent_coef_loss | 0.766 |
| learning_rate | 0.0003 |
| n_updates | 2999984 |
----------------------------------
----------------------------------
| rollout/ | |
| ep_len_mean | 45.9 |
| ep_rew_mean | -153 |
| time/ | |
| episodes | 6863376 |
| fps | 8300 |
| time_elapsed | 36141 |
| total_timesteps | 299998800 |
| train/ | |
| actor_loss | 89.2 |
| critic_loss | 41.6 |
| ent_coef | 0.128 |
| ent_coef_loss | 0.569 |
| learning_rate | 0.0003 |
| n_updates | 2999986 |
----------------------------------
----------------------------------
| rollout/ | |
| ep_len_mean | 45.8 |
| ep_rew_mean | -156 |
| time/ | |
| episodes | 6863380 |
| fps | 8300 |
| time_elapsed | 36141 |
| total_timesteps | 299999000 |
| train/ | |
| actor_loss | 80.4 |
| critic_loss | 73.3 |
| ent_coef | 0.128 |
| ent_coef_loss | -0.194 |
| learning_rate | 0.0003 |
| n_updates | 2999988 |
----------------------------------
----------------------------------
| rollout/ | |
| ep_len_mean | 45.1 |
| ep_rew_mean | -155 |
| time/ | |
| episodes | 6863384 |
| fps | 8300 |
| time_elapsed | 36141 |
| total_timesteps | 299999200 |
| train/ | |
| actor_loss | 79.2 |
| critic_loss | 59.3 |
| ent_coef | 0.128 |
| ent_coef_loss | 0.181 |
| learning_rate | 0.0003 |
| n_updates | 2999990 |
----------------------------------
----------------------------------
| rollout/ | |
| ep_len_mean | 44.7 |
| ep_rew_mean | -155 |
| time/ | |
| episodes | 6863388 |
| fps | 8300 |
| time_elapsed | 36141 |
| total_timesteps | 299999300 |
| train/ | |
| actor_loss | 81.1 |
| critic_loss | 48.5 |
| ent_coef | 0.128 |
| ent_coef_loss | -0.04 |
| learning_rate | 0.0003 |
| n_updates | 2999991 |
----------------------------------
----------------------------------
| rollout/ | |
| ep_len_mean | 44.7 |
| ep_rew_mean | -155 |
| time/ | |
| episodes | 6863392 |
| fps | 8300 |
| time_elapsed | 36141 |
| total_timesteps | 299999400 |
| train/ | |
| actor_loss | 77.1 |
| critic_loss | 107 |
| ent_coef | 0.128 |
| ent_coef_loss | -0.659 |
| learning_rate | 0.0003 |
| n_updates | 2999992 |
----------------------------------
----------------------------------
| rollout/ | |
| ep_len_mean | 43.4 |
| ep_rew_mean | -169 |
| time/ | |
| episodes | 6863396 |
| fps | 8300 |
| time_elapsed | 36141 |
| total_timesteps | 299999500 |
| train/ | |
| actor_loss | 84.7 |
| critic_loss | 47.6 |
| ent_coef | 0.128 |
| ent_coef_loss | 0.0928 |
| learning_rate | 0.0003 |
| n_updates | 2999993 |
----------------------------------
----------------------------------
| rollout/ | |
| ep_len_mean | 43.8 |
| ep_rew_mean | -170 |
| time/ | |
| episodes | 6863400 |
| fps | 8300 |
| time_elapsed | 36141 |
| total_timesteps | 299999600 |
| train/ | |
| actor_loss | 84.2 |
| critic_loss | 54.3 |
| ent_coef | 0.128 |
| ent_coef_loss | -0.148 |
| learning_rate | 0.0003 |
| n_updates | 2999994 |
----------------------------------
----------------------------------
| rollout/ | |
| ep_len_mean | 42 |
| ep_rew_mean | -170 |
| time/ | |
| episodes | 6863404 |
| fps | 8300 |
| time_elapsed | 36141 |
| total_timesteps | 299999800 |
| train/ | |
| actor_loss | 83.1 |
| critic_loss | 83.4 |
| ent_coef | 0.129 |
| ent_coef_loss | 0.247 |
| learning_rate | 0.0003 |
| n_updates | 2999996 |
----------------------------------
----------------------------------
| rollout/ | |
| ep_len_mean | 41 |
| ep_rew_mean | -172 |
| time/ | |
| episodes | 6863408 |
| fps | 8300 |
| time_elapsed | 36141 |
| total_timesteps | 300000000 |
| train/ | |
| actor_loss | 86.5 |
| critic_loss | 136 |
| ent_coef | 0.129 |
| ent_coef_loss | 0.482 |
| learning_rate | 0.0003 |
| n_updates | 2999998 |
----------------------------------
100% ━━━━━━━━━━━━━━━━━━ 300,000,000/300,00… [ 10:02:21 < 0:00:00 , 8,326 it/s ]
Training process finished.
Training duration: 10:02:22.34
Model name: 2024-04-18_06-49
before saving: Mean reward per episode: -165.8153335 , std of reward per episode 107.81515792616958
Process finished with exit code 0
Can you share the explanation?
@qgallouedec sorry about this.
Still unsure about the consequences, but there are a few possible reasons.
buffer size / n_envs
. Although increasing the buffer size for vec env is not necessary (see 1885), but large buffer size might cause this reward discontinuous.But this discontinuity seems to have little influence on the outcome, so I decided to leave it for now. Thanks.
❓ Question
I'm using vecenv (n_envs=100) and SAC algorithm, and I met some discontinuous reward training curve.
especially from 60m to 80m, there was a huge increase without any fluctuation, is this normal?
Checklist