DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
8.69k stars 1.65k forks source link

[Bug]: The episode length in rollout logs is greater than the set horizon #1684

Closed AdityaPradhan12 closed 11 months ago

AdityaPradhan12 commented 11 months ago

🐛 Bug

I have noticed that whenever, evalution run is executed, in the successive training log, the mean episode length becomes greater than my set episode horizon. So, I guess the agent doesnt continue the unfinished training rollout ( the one before the evaluation was triggered), rather the agent resets and runs a new rollout episode but it still somehow counts the time steps from its unfinished episode as a part of the new rollout episode.

In the logs which I have attached, I set the horizon to 15. And it can be seen that the rollout/ep_len_mean changes from 15 to 15.4 after evaluation.

Relevant log output / Error message

---------------------------------
| rollout/           |          |
|    ep_len_mean     | 15       |
|    ep_rew_mean     | -15.5    |
| time/              |          |
|    episodes        | 9        |
|    fps             | 12       |
|    time_elapsed    | 10       |
|    total_timesteps | 135      |
| train/             |          |
|    actor_loss      | -19.1    |
|    critic_loss     | 8.99     |
|    ent_coef        | 0.961    |
|    ent_coef_loss   | -0.998   |
|    learning_rate   | 0.0003   |
|    n_updates       | 134      |
---------------------------------
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 15       |
|    ep_rew_mean     | -16.1    |
| time/              |          |
|    episodes        | 10       |
|    fps             | 12       |
|    time_elapsed    | 12       |
|    total_timesteps | 150      |
| train/             |          |
|    actor_loss      | -19.9    |
|    critic_loss     | 7.59     |
|    ent_coef        | 0.957    |
|    ent_coef_loss   | -1.1     |
|    learning_rate   | 0.0003   |
|    n_updates       | 149      |
---------------------------------
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 15       |
|    ep_rew_mean     | -16.2    |
| time/              |          |
|    episodes        | 11       |
|    fps             | 12       |
|    time_elapsed    | 13       |
|    total_timesteps | 165      |
| train/             |          |
|    actor_loss      | -19.8    |
|    critic_loss     | 5.95     |
|    ent_coef        | 0.952    |
|    ent_coef_loss   | -1.22    |
|    learning_rate   | 0.0003   |
|    n_updates       | 164      |
---------------------------------
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 15       |
|    ep_rew_mean     | -16      |
| time/              |          |
|    episodes        | 12       |
|    fps             | 12       |
|    time_elapsed    | 14       |
|    total_timesteps | 180      |
| train/             |          |
|    actor_loss      | -19.6    |
|    critic_loss     | 8.68     |
|    ent_coef        | 0.948    |
|    ent_coef_loss   | -1.31    |
|    learning_rate   | 0.0003   |
|    n_updates       | 179      |
---------------------------------
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 15       |
|    ep_rew_mean     | -15.4    |
| time/              |          |
|    episodes        | 13       |
|    fps             | 12       |
|    time_elapsed    | 15       |
|    total_timesteps | 195      |
| train/             |          |
|    actor_loss      | -19.9    |
|    critic_loss     | 1.77     |
|    ent_coef        | 0.944    |
|    ent_coef_loss   | -1.46    |
|    learning_rate   | 0.0003   |
|    n_updates       | 194      |
---------------------------------
/media/aditya/OS/Users/Aditya/Documents/Uni_Studies/Thesis/master_thesis/1_8/robosuite/stable-baselines3/Callbacks/test.py:121: UserWarning: Evaluation environment is not wrapped with a ``Monitor`` wrapper. This may result in reporting modified episode lengths and rewards, if other wrappers happen to modify these. Consider wrapping environment first with ``Monitor`` wrapper.
  warnings.warn(
Eval num_timesteps=200, episode_reward=-21.71 +/- 0.00
Episode length: 15.00 +/- 0.00
---------------------------------
| eval/              |          |
|    %traversed      | 1/8     |
|    mean_ep_force   | 0.0639   |
|    mean_ep_length  | 15       |
|    mean_ep_x_dev   | 0.0112   |
|    mean_reward     | -21.7    |
|    std_ep_force    | 0.125    |
|    std_ep_x_dev    | 6.54e-05 |
| time/              |          |
|    total_timesteps | 200      |
| train/             |          |
|    actor_loss      | -20.2    |
|    critic_loss     | 1.74     |
|    ent_coef        | 0.942    |
|    ent_coef_loss   | -1.49    |
|    learning_rate   | 0.0003   |
|    n_updates       | 199      |
---------------------------------
New best mean reward!
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 15.4     |
|    ep_rew_mean     | -15.9    |
| time/              |          |
|    episodes        | 14       |
|    fps             | 10       |
|    time_elapsed    | 20       |
|    total_timesteps | 215      |
| train/             |          |
|    actor_loss      | -20.5    |
|    critic_loss     | 20.5     |
|    ent_coef        | 0.938    |
|    ent_coef_loss   | -1.58    |
|    learning_rate   | 0.0003   |
|    n_updates       | 214      |
---------------------------------
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 15.3     |
|    ep_rew_mean     | -16.1    |
| time/              |          |
|    episodes        | 15       |
|    fps             | 10       |
|    time_elapsed    | 21       |
|    total_timesteps | 230      |
| train/             |          |
|    actor_loss      | -20.5    |
|    critic_loss     | 15.8     |
|    ent_coef        | 0.934    |
|    ent_coef_loss   | -1.66    |
|    learning_rate   | 0.0003   |
|    n_updates       | 229      |
---------------------------------

System Info

No response

Checklist

araffin commented 11 months ago

I have provided a minimal and working example to reproduce the bug

Closing for the above reason.