Frist of all, thank you for the great work you've done! The code of this reproduction is very clear.
Here's my problem.
I add a best_model_save_path parameter to the EvalCallback call in the script.py, so I can get the best model after the training steps. But after I try to evaluate the model using evaluate_policy from stable_baselines3.common.evaluation, I got really confused. The reward I got from this evaluation is negative, which is far away from the episode_reward in the logs, it's even worse than the fisrt eval result during the training. Why is it the case? I looked at the docs of stable-baselines3, the EvalCallback also uses evaluate_policy to get the reward values, so the results should be close.
In my test code, I just load the env like the script.py, and here's my evaluation process.
Agent = getattr(stable_baselines3, args.agent)
model = Agent.load("./testing/evaluation/model/best_model")
print(evaluate_policy(model, env))
Actually, I found this bcs I tried to tune the hyperparameters using Optuna, but the value given by Optuna is negative, while the episode reward is positive and pretty large. I really get confused by this result.
Thanks again!
Frist of all, thank you for the great work you've done! The code of this reproduction is very clear. Here's my problem. I add a
best_model_save_path
parameter to theEvalCallback
call in thescript.py
, so I can get the best model after the training steps. But after I try to evaluate the model usingevaluate_policy
fromstable_baselines3.common.evaluation
, I got really confused. The reward I got from this evaluation is negative, which is far away from theepisode_reward
in the logs, it's even worse than the fisrt eval result during the training. Why is it the case? I looked at the docs of stable-baselines3, the EvalCallback also usesevaluate_policy
to get the reward values, so the results should be close.In my test code, I just load the env like the
script.py
, and here's my evaluation process.Actually, I found this bcs I tried to tune the hyperparameters using Optuna, but the value given by Optuna is negative, while the
episode reward
is positive and pretty large. I really get confused by this result. Thanks again!