DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
9.12k stars 1.7k forks source link

[Question] Cannot reproduce results of "EvalCallback" gathered during training. #2036

Open felix-basiliskroko opened 5 days ago

felix-basiliskroko commented 5 days ago

❓ Question

During Training I wrap my custom Gymnasium Environment in the EvalCallback wrapper to record the performance of my agent when actions are decided deterministically:

eval_env = make_vec_env(env_id=env_id, seed=42)
eval_callback = EvalCallback(eval_env, best_model_save_path=f"./{check_root_dir}/{run}/{mod}",
                 log_path=f"./{check_root_dir}/{run}/{mod}", eval_freq=20_000,
                 deterministic=True, render=False, n_eval_episodes=10)

...

model.learn(total_timesteps=2_000_000, callback=eval_callback)

During training, the eval/mean_reward converges to approximately -10.0, so I had a look at the _on_step method of EvalCallback to reproduce these score and visualise what exactly the agent has learned:

vec_env = make_vec_env(env_id=env_id, n_envs=1, seed=42)
model = PPO("MultiInputPolicy", env=vec_env)
model.load(model_path, deterministic=deterministic)
episode_rewards, _ = evaluate_policy(model, vec_env, n_eval_episodes=10, render=False, deterministic=True, return_episode_rewards=True)
mean_reward = np.mean(episode_rewards)

Where I have triple-checked that the model that is being loaded is the same as the one saved in EvalCallback, the same deterministic- and return_episode_rewards-flag is set and even that the seed for both environments is the same. But still:

print(mean_reward) -> -500.0

Which is so far off the evaluated mean_reward during training that something must be off and cannot simply be attributed to stochasticity in the environment and normal deviation from the mean.

I have tried everything I could have thought of and I can't seem to figure out where this difference comes from. Would that indicate that something in my custom environment could cause the discrepancy or am I missing a crucial detail?

Checklist

amabilee commented 5 days ago

Hey there!

Given that the discrepancy is so large, it does suggest there might be an issue with your custom environment or the way it's being handled during evaluation.

  1. Ensure that the environment is being reset correctly before each evaluation episode. Any residual state from previous episodes could affect the evaluation.
  2. Verify that the action and observation spaces are identical between the training and evaluation environments. Any differences could lead to unexpected behavior.
  3. Double-check the reward calculation logic in your custom environment. Ensure that it's consistent and correctly implemented in both training and evaluation modes.
  4. Make sure that any randomness in your environment (e.g., initial states, stochastic transitions) is controlled or eliminated during evaluation to ensure deterministic behavior.
  5. If you're using any wrappers in your evaluation environment, ensure they are identical to those used during training. Even subtle differences can lead to significant discrepancies.
araffin commented 5 days ago

Duplicate of https://github.com/DLR-RM/stable-baselines3/issues/928#issuecomment-1145831061 and others