Open felix-basiliskroko opened 5 days ago
Hey there!
Given that the discrepancy is so large, it does suggest there might be an issue with your custom environment or the way it's being handled during evaluation.
Duplicate of https://github.com/DLR-RM/stable-baselines3/issues/928#issuecomment-1145831061 and others
❓ Question
During Training I wrap my custom Gymnasium Environment in the EvalCallback wrapper to record the performance of my agent when actions are decided deterministically:
During training, the
eval/mean_reward
converges to approximately -10.0, so I had a look at the_on_step
method ofEvalCallback
to reproduce these score and visualise what exactly the agent has learned:Where I have triple-checked that the model that is being loaded is the same as the one saved in
EvalCallback
, the same deterministic- and return_episode_rewards-flag is set and even that the seed for both environments is the same. But still:Which is so far off the evaluated
mean_reward
during training that something must be off and cannot simply be attributed to stochasticity in the environment and normal deviation from the mean.I have tried everything I could have thought of and I can't seem to figure out where this difference comes from. Would that indicate that something in my custom environment could cause the discrepancy or am I missing a crucial detail?
Checklist