Training exceeds total_timesteps

DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.

MIT License

8.97k stars 1.69k forks source link

import stable_baselines3 import gym from stable_baselines3 import DQN, A2C, PPO #from sb3_contrib import ARS, TRPO env = gym.make('MountainCar-v0') seed = 42 verbose = 1 timesteps = 10_000 DQN("MlpPolicy", env, verbose=verbose, seed=seed).learn(total_timesteps=timesteps) # 9600 A2C("MlpPolicy", env, verbose=verbose, seed=seed).learn(total_timesteps=timesteps) # 10_000 PPO("MlpPolicy", env, verbose=verbose, seed=seed).learn(total_timesteps=timesteps) # 10_240 #ARS("MlpPolicy", env, verbose=verbose, seed=seed).learn(total_timesteps=timesteps) # 12_800 #TRPO("MlpPolicy", env, verbose=verbose, seed=seed).learn(total_timesteps=timesteps) # 10_240

Hello,

It is because of how the algorithms work. For short:

PPO/A2C and derivates collect n_steps * n_envs of experience before performing an update, so if you want to have exactly total_timesteps you will need to adjust those values
SAC/DQN/TD3 and other off-policy algorithms collect train_freq * n_envs steps before performing an update (when train freq is in steps), so if you want to have exactly total_timesteps you will need to adjust those values (train_freq=4 by default for DQN)
ARS and other population based algorithms evaluate the policy for n_episodes with n_envs, so unless the number of steps per episode is fixed, it is not possible to exactly achieve total_timesteps
when using multiple envs, each call to env.step() corresponds to n_envs timesteps, so it is no longer possible to use the EvaluationCallback at an exact timestep

for DQN we miss the last evaluation(s).

this sounds more like a bug, could you provide a minimal example to reproduce that issue

Also setting reset_num_timesteps=False does not change anything (and I am not sure what it is supposed to change)

this is for plotting or when you don't want to perform a reset when calling learn() multliple times

DLR-RM / stable-baselines3

Training exceeds total_timesteps #1150

❓ Question

Checklist