Open timosturm opened 1 year ago
Hello,
Related to https://github.com/DLR-RM/stable-baselines3/issues/1059, probably duplicate of https://github.com/DLR-RM/stable-baselines3/issues/457
It is because of how the algorithms work. For short:
n_steps * n_envs
of experience before performing an update, so if you want to have exactly total_timesteps
you will need to adjust those valuestrain_freq * n_envs
steps before performing an update (when train freq is in steps), so if you want to have exactly total_timesteps
you will need to adjust those values (train_freq=4
by default for DQN)n_episodes
with n_envs
, so unless the number of steps per episode is fixed, it is not possible to exactly achieve total_timesteps
env.step()
corresponds to n_envs
timesteps, so it is no longer possible to use the EvaluationCallback
at an exact timestepfor DQN we miss the last evaluation(s).
this sounds more like a bug, could you provide a minimal example to reproduce that issue
Also setting reset_num_timesteps=False does not change anything (and I am not sure what it is supposed to change)
this is for plotting or when you don't want to perform a reset when calling learn()
multliple times
❓ Question
Consider this setup:
The problem is, that some of the agents train for (much) more time steps than specified. This changes depending on the number of timesteps set, e.g., DQN trains for exactly
100_000
time steps if specified. DQN often seems to train for fewer steps and PPO for more steps than specified.This behavior is also bad when using the
EvaluationCallback
, because for some algorithms we do more (time consuming) evaluations than requested and for DQN we miss the last evaluation(s).The question was asked before, here but no real solution was provided. Also setting
reset_num_timesteps=False
does not change anything (and I am not sure what it is supposed to change). I also tested this for different gym environments, but the problem persists.What is the reason for this behavior? Can it be changed?
Checklist