DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
8.97k stars 1.69k forks source link

Training exceeds total_timesteps #1150

Open timosturm opened 1 year ago

timosturm commented 1 year ago

❓ Question

Consider this setup:

import stable_baselines3
import gym
from stable_baselines3 import DQN, A2C, PPO
#from sb3_contrib import ARS, TRPO

env = gym.make('MountainCar-v0')

seed = 42
verbose = 1
timesteps = 10_000

DQN("MlpPolicy", env, verbose=verbose, seed=seed).learn(total_timesteps=timesteps) # 9600
A2C("MlpPolicy", env, verbose=verbose, seed=seed).learn(total_timesteps=timesteps) # 10_000
PPO("MlpPolicy", env, verbose=verbose, seed=seed).learn(total_timesteps=timesteps) # 10_240
#ARS("MlpPolicy", env, verbose=verbose, seed=seed).learn(total_timesteps=timesteps) # 12_800
#TRPO("MlpPolicy", env, verbose=verbose, seed=seed).learn(total_timesteps=timesteps) # 10_240

The problem is, that some of the agents train for (much) more time steps than specified. This changes depending on the number of timesteps set, e.g., DQN trains for exactly 100_000 time steps if specified. DQN often seems to train for fewer steps and PPO for more steps than specified.

This behavior is also bad when using the EvaluationCallback, because for some algorithms we do more (time consuming) evaluations than requested and for DQN we miss the last evaluation(s).

The question was asked before, here but no real solution was provided. Also setting reset_num_timesteps=False does not change anything (and I am not sure what it is supposed to change). I also tested this for different gym environments, but the problem persists.

What is the reason for this behavior? Can it be changed?

Checklist

araffin commented 1 year ago

Hello,

Related to https://github.com/DLR-RM/stable-baselines3/issues/1059, probably duplicate of https://github.com/DLR-RM/stable-baselines3/issues/457

It is because of how the algorithms work. For short:

for DQN we miss the last evaluation(s).

this sounds more like a bug, could you provide a minimal example to reproduce that issue

Also setting reset_num_timesteps=False does not change anything (and I am not sure what it is supposed to change)

this is for plotting or when you don't want to perform a reset when calling learn() multliple times