DLR-RM / rl-baselines3-zoo

A training framework for Stable Baselines3 reinforcement learning agents, with hyperparameter optimization and pre-trained agents included.
https://rl-baselines3-zoo.readthedocs.io
MIT License
2.07k stars 514 forks source link

[Bug]: Wrong calculation of warmp_steps for median pruner #344

Closed qihuazhong closed 1 year ago

qihuazhong commented 1 year ago

šŸ› Bug

The current calculation of the n_warmup_steps is to divide n_evaluations by 3, which does not really make sense.

elif pruner_method == "median":
    pruner = MedianPruner(n_startup_trials=self.n_startup_trials, n_warmup_steps=self.n_evaluations // 3)

According to the usage/definition, n_evaluations is a very small number (Default is 1 evaluation per 100k timesteps), compared to relevant variables, for example learning_start=50000 of the DQN algo.

Algos with la arge number of learning_start, such as DQN, is likely affected the most. I believe this results in DQN models being pruned before learning even starts in the default setting.

The intended calculation should probably be:

elif pruner_method == "median":
    pruner = MedianPruner(n_startup_trials=self.n_startup_trials, n_warmup_steps=self.n_timesteps // 3)

I could submit a PR if the change is agreed.

To Reproduce

python train.py --algo dqn --env MountainCar-v0 --pruner median

Relevant log output / Error message

No response

System Info

No response

Checklist

araffin commented 1 year ago

Hello, I need to double check but if I recall, n_warmup_steps refers to the number of evaluations reported to Optuna so far, not the n_timesteps which is internal to SB3 (and not known by Optuna).

qihuazhong commented 1 year ago

Hello, I need to double check but if I recall, n_warmup_steps refers to the number of evaluations reported to Optuna so far, not the n_timesteps which is internal to SB3 (and not known by Optuna).

As I read into the codes, now I understand what you are talking about... So the original calculation n_warmup_steps=self.n_evaluations // 3 is fine. The culprit of my optimiation not working was not here.

On a relevant topic though, should we force the first evaluation to only start after the learning_start steps for algos like DQN and TD3?

araffin commented 1 year ago

should we force the first evaluation to only start after the learning_start steps for algos like DQN and TD3?

this can't be done as the learning starts is usually optimized. Anyway, the pruner will quickly prune the trials where the learning_starts is too high compared to the total budget (and those trials are cheap to run).

closing as the original question was answered.