[Question] SaveOnBestTrainingRewardCallback

DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.

https://stable-baselines3.readthedocs.io

MIT License

9.07k stars 1.7k forks source link

[Question] SaveOnBestTrainingRewardCallback #634

Closed danielstankw closed 2 years ago

danielstankw commented 3 years ago

Important Note: We do not do technical support, nor consulting and don't answer personal questions per email. Please post your question on the RL Discord, Reddit or Stack Overflow in that case.

Question

I want to use the SaveOnBestTrainingRewardCallback given in the stable baselines example but when using SubprocVecEnv with more than 1 env. The callback given in the example is not suitable for using with multiple env simultaneously. Did anyone modified it by any chance and would be willing to share a version that works for multiple env?

Additional context

...

Checklist

[x] I have read the documentation (required)
[x] I have checked that there is no similar issue in the repo (required)

araffin commented 3 years ago

Hello,

Why are you not using an EvalCallback ? (recommended way, see doc, also included in the RL Zoo https://github.com/DLR-RM/rl-baselines3-zoo)

Otherwise, if you cannot have an evaluation env, you can also retrieve the reward via the logger (rollout/ep_rew_mean key)

danielstankw commented 3 years ago

Will try it out, thx

danielstankw commented 2 years ago

I want to reopen this issue

env = SubprocVecEnv([make_robosuite_env(env_id, env_options, i, seed_val) for i in range(num_proc)])
 eval_callback = EvalCallback(env,
                                 best_model_save_path=log_dir,
                                 log_path=log_dir,
                                 eval_freq=3,
                                 deterministic=False,
                                 render=False)
    policy_kwargs = dict(activation_fn=torch.nn.LeakyReLU, net_arch=[32, 32])
    model = PPO('MlpPolicy', env, verbose=1, policy_kwargs=policy_kwargs, n_steps=int(n_steps/num_proc),
                tensorboard_log="./learning_log/ppo_tensorboard/", seed=4)

    model.learn(total_timesteps=10000, tb_log_name="learning", callback=eval_callback, reset_num_timesteps=True)

I tried the following and got an error assert eval_env.num_envs == 1, "You must pass only one environment for evaluation"

araffin commented 2 years ago

I tried the following and got an error assert eval_env.num_envs == 1, "You must pass only one environment for evaluation"

Please upgrade SB3 version (see issue template, you need to give your config along your issue).

danielstankw commented 2 years ago

Ok gotcha, Thank you very much, Do you know if the custom callback given in examples SaveOnBestTrainingRewardCallback also facilitates multiple envs?

araffin commented 2 years ago

also facilitates multiple envs?

it should work but please don't use it, it is mainly meant as a demo on what you can do with callbacks. Better is to use CheckpointCallback if you cannot evaluate at the same time of training. I think I will update the doc.

danielstankw commented 2 years ago

Thanks, I would like to use model that gave me highest reward so why do you say its not good to use the SaveOnBestTrainingRewardCallback? CheckpointCallback doesnt give me the functionality I want.

araffin commented 2 years ago

I would like to use model that gave me highest reward

SaveOnBestTrainingRewardCallback only gives you information about a proxy, mean episodic return for the training agent over n training episodes, but the agent is changing between each episode, the true performance at time t can only be known by evaluating it on a separate env for multiple episodes (that's what you can do with EvalCallback or do as a post-processing step with CheckpointCallback).

If you are doing continuous control, the controller that you are using at the end should be deterministic, which is not the one used for collecting data.

danielstankw commented 2 years ago

Thanks a lot for explanation!