Evaluation runs way too many evaluation episodes

JakobThumm commented 1 year ago

Describe the bug The evaluation runs more than n_eval_episodes. (>100 eval episodes or even infinite)

Code example For my custom env, the evaluation runs for >100 episodes, even though I set the number of eval episodes to 3.

I was able to reproduce the error for a common environment:

python train.py --algo sac --env BipedalWalkerHardcore-v3 --yaml-file hyperparameters/sac.yml -P --seed 42 --eval-freq 5000 --eval-episodes 3 --n-eval-envs 1

sac.yml

BipedalWalkerHardcore-v3:
  env_wrapper:
    - gym.wrappers.TimeLimit:
        max_episode_steps: 1000
  n_timesteps: !!float 1e7
  policy: 'MlpPolicy'
  learning_rate: lin_7.3e-4
  buffer_size: 1000000
  batch_size: 256
  ent_coef: 0.005
  gamma: 0.99
  tau: 0.01
  train_freq: 1
  gradient_steps: 1
  learning_starts: 10000
  policy_kwargs: "dict(net_arch=[256, 256])"

Note that this issue occurs if and only if I change the net_arch from [400, 300] to [256, 256]. This issue also does not occur on seed 0, but it does happen on seed 42.

Apparently, the evaluation is doing more than I expect. I would assume, the evaluation just runs for the given number of episodes and then continues training.

System Info Describe the characteristic of your environment:

sb3-contrib 1.6.1 rl-zoo3 1.6.2.post1 (from source)
Problem occurs both on cuda and cpu settings.
Python version Python 3.8.13

Additional Info I created a simple wrapper that prints a statement whenever a new episode begins to debug this issue.

araffin commented 1 year ago

Hello, ~how do you know it is doing more than 3 evaluations episodes?~

env_wrapper:

gym.wrappers.TimeLimit: max_episode_steps: 1000

Why are you adding a timelimit? If you do so, you need to add a monitor file afterward so it is taken into account. Otherwise the evaluation will only use the original termination (see https://github.com/DLR-RM/stable-baselines3/issues/181 for why we are doing that).

EDIT: to check the number of evaluations:

import numpy as np

evaluations = np.load("logs/sac/BipedalWalkerHardcore-v3_12/evaluations.npz")
print(evaluations["ep_lengths"].shape)

JakobThumm commented 1 year ago

Why are you adding a timelimit?

In my custom environment, I would like to have limited episode length. Isn't the TimeLimit Wrapper the way to go then?

If you do so, you need to add a monitor file afterward so it is taken into account.

I added the basic common.monitor.Monitor, which fixed the issue. I still don't fully understand why we need the monitor after reading the linked issue. However, if simply adding a monitor fixes the issue, I'm happy :) Thank you Antonin

araffin commented 1 year ago

I still don't fully understand why we need the monitor after reading the linked issue.

Best is to take a look at the code: https://github.com/DLR-RM/stable-baselines3/blob/52c29dc497fa2eb235d0476b067bed8ac488fe64/stable_baselines3/common/evaluation.py#L103-L114

JakobThumm commented 1 year ago

This clarifies the matter, thanks :+1:

DLR-RM / rl-baselines3-zoo

Evaluation runs way too many evaluation episodes #296