DLR-RM / rl-baselines3-zoo

A training framework for Stable Baselines3 reinforcement learning agents, with hyperparameter optimization and pre-trained agents included.
https://rl-baselines3-zoo.readthedocs.io
MIT License
1.98k stars 505 forks source link

[Question] Custom environment: First episode in evaluation always performs poorly #289

Open RaffaeleGalliera opened 1 year ago

RaffaeleGalliera commented 1 year ago

When running evaluations during training or enjoy.py the first episode always performs poorly, while the subsequent episodes are more in line with the expected performance. Initially I thought it was just a random behavior, but I can notice it every time now. Is there any known problem that might cause this?

System Info Describe the characteristic of your environment:

Additional context I am using SAC with TimeLimit, History wrappers and negative rewards. The environment is a bit complicated to explain as there are some external processes that are spawned at the beginning of each episode and cleaned up when reset is called.

RaffaeleGalliera commented 1 year ago

I did some further checks, retraining the agent different times also changing rewards and hyperparameters. The issue actually appears only when using enjoy.py and not when evaluating the during training. So my bad, when running evaluations during training everything looks fine. Doing some further investigations I have also noticed that the agent actually acts way differently from how it was doing throughout all the training process, actions are different compared to training and achieve significant lower rewards (on avg. it achieved ~-120 during training, but when using enjoy.py we are lucky to get something above -150 and in average is way lower -150).

The environment is normalized and is set as an infinite-horizon task (I do use TimeLimit), I always use the same seed for now, and I save checkpoints every 100K steps and tried all of them out of curiosity, but every policy performs poorly (including best). It is worth to mention that the environment has a certain degree of stochasticity, but I exclude the problem as after 1M steps (5000 episodes) the mean reward is consistent. Could this be somehow linked to normalization statistics? Any idea what I could be missing or if there is anything I can try?

Also: The task I use for evaluation is different from the one used during training, and all this considerations are done directly using the training task, so I can exclude that poor performance was linked to the eval/testing task being different. Please let me know if any additional detail could be helpful! Thanks!

Below the mean reward graph during training: image

araffin commented 1 year ago

Below the mean reward graph during training:

probably related to https://github.com/DLR-RM/stable-baselines3/issues/1063, you should try with stochastic controller at eval time --stochastic with the enjoy script.

The environment is normalized

you mean you are using VecNormalize?

chengwei-xia commented 1 year ago

Could you please show some details of the evaluation (enjoy.py)? I am facing a very similar issue as yours.