Open RaffaeleGalliera opened 1 year ago
I did some further checks, retraining the agent different times also changing rewards and hyperparameters. The issue actually appears only when using enjoy.py and not when evaluating the during training. So my bad, when running evaluations during training everything looks fine. Doing some further investigations I have also noticed that the agent actually acts way differently from how it was doing throughout all the training process, actions are different compared to training and achieve significant lower rewards (on avg. it achieved ~-120 during training, but when using enjoy.py we are lucky to get something above -150 and in average is way lower -150).
The environment is normalized and is set as an infinite-horizon task (I do use TimeLimit), I always use the same seed for now, and I save checkpoints every 100K steps and tried all of them out of curiosity, but every policy performs poorly (including best). It is worth to mention that the environment has a certain degree of stochasticity, but I exclude the problem as after 1M steps (5000 episodes) the mean reward is consistent. Could this be somehow linked to normalization statistics? Any idea what I could be missing or if there is anything I can try?
Also: The task I use for evaluation is different from the one used during training, and all this considerations are done directly using the training task, so I can exclude that poor performance was linked to the eval/testing task being different. Please let me know if any additional detail could be helpful! Thanks!
Below the mean reward graph during training:
Below the mean reward graph during training:
probably related to https://github.com/DLR-RM/stable-baselines3/issues/1063, you should try with stochastic controller at eval time --stochastic
with the enjoy script.
The environment is normalized
you mean you are using VecNormalize
?
Could you please show some details of the evaluation (enjoy.py)? I am facing a very similar issue as yours.
When running evaluations during training or enjoy.py the first episode always performs poorly, while the subsequent episodes are more in line with the expected performance. Initially I thought it was just a random behavior, but I can notice it every time now. Is there any known problem that might cause this?
System Info Describe the characteristic of your environment:
Additional context I am using SAC with TimeLimit, History wrappers and negative rewards. The environment is a bit complicated to explain as there are some external processes that are spawned at the beginning of each episode and cleaned up when reset is called.