[Question] Huge performance difference with different n_envs

Darklanx commented 2 years ago

Question

I'm running a2c with default parameters on BreakoutNoFrameskip-v4, with two different training scenario, where the only difference is that one uses n_envs=16 (orange) while the other one set n_envs=40 (blue). However the performance difference is huge. Is there a particular reason behind this behavior? I thought n_envs is more of a parallelization parameter, which shouldn't have such a huge impact on performance.

Additional context

config.yml:

!!python/object/apply:collections.OrderedDict
- - - ent_coef
    - 0.01
  - - env_wrapper
    - - stable_baselines3.common.atari_wrappers.AtariWrapper
  - - frame_stack
    - 4
  - - n_envs
    - 40 (or 16)
  - - n_timesteps
    - 10000000.0
  - - policy
    - CnnPolicy
  - - policy_kwargs
    - dict(optimizer_class=RMSpropTFLike, optimizer_kwargs=dict(eps=1e-5))
  - - vf_coef
    - 0.25

args.yml:

!!python/object/apply:collections.OrderedDict
- - - algo
    - a2c
  - - env
    - BreakoutNoFrameskip-v4
  - - env_kwargs
    - null
  - - eval_episodes
    - 5
  - - eval_freq
    - 10000
  - - gym_packages
    - []
  - - hyperparams
    - null
  - - log_folder
    - logs
  - - log_interval
    - -1
  - - n_eval_envs
    - 1
  - - n_evaluations
    - 20
  - - n_jobs
    - 1
  - - n_startup_trials
    - 10
  - - n_timesteps
    - -1
  - - n_trials
    - 10
  - - no_optim_plots
    - false
  - - num_threads
    - -1
  - - optimization_log_path
    - null
  - - optimize_hyperparameters
    - false
  - - pruner
    - median
  - - sampler
    - tpe
  - - save_freq
    - -1
  - - save_replay_buffer
    - false
  - - seed
    - 0
  - - storage
    - null
  - - study_name
    - null
  - - tensorboard_log
    - ''
  - - trained_agent
    - ''
  - - truncate_last_trajectory
    - true
  - - uuid
    - false
  - - vec_env
    - dummy
  - - verbose
    - 1

Checklist

[x] I have read the documentation (required)
[x] I have checked that there is no similar issue in the repo (required)

Darklanx commented 2 years ago

Moving this to rl-baseline3-zoo

pengzhi1998 commented 2 years ago

I have faced the same problem with PPO. Did you solve the issue?

JakobThumm commented 1 year ago

Hello, @araffin pointed out this Colab https://colab.research.google.com/github/araffin/rl-tutorial-jnrr19/blob/sb3/3_multiprocessing.ipynb in another similar issue (I cannot find it anymore, sorry.) Tl;dr of this colab was that increasing the number of n_envs while keeping n_timesteps constant, leads to performance drops.

To understand this issue, we have to understand how rollout collection and policy training work in SB3 for on-policy algorithms. Your training loop (https://github.com/DLR-RM/stable-baselines3/blob/e39bc3da00c49413b765176af1b95f2361a35098/stable_baselines3/common/on_policy_algorithm.py#L246) looks like this:

        while self.num_timesteps < total_timesteps:
            self.collect_rollouts(self.env, self.rollout_buffer, n_rollout_steps=self.n_steps)
            self.train()

As you can see, we will first collect rollouts and then train the agent until the training is finished. Now, let's have a look at collect_rollouts (https://github.com/DLR-RM/stable-baselines3/blob/e39bc3da00c49413b765176af1b95f2361a35098/stable_baselines3/common/on_policy_algorithm.py#L158) -> strongly simplified:

        for i in range(self.n_steps):
            actions = self.policy(obs)
            obs, r, d, infos = env.step(actions)  # execute 1 step in each of the num_envs environments 
            self.num_timesteps += env.num_envs
            self.rollout_buffer.add(obs, r, d, infos, actions)

As you can see in each rollout collection there will be n_steps*n_envs transitions added to the rollout_buffer. Small example: If you have n_envs=8, n_steps=10, n_timesteps=800, then you will call the collect_rollouts function 10 times, and your rollout buffer has size 80 (flattend). Therefore, you will also only call your train function 10 times. So, if we increase the number of environments and keep the step size the same, we will call the train function less often but with a larger rollout buffer.
A2C is a very special algorithm in this regard, as it does not have a batch size! A2C will always use the full rollout buffer size as a batch size. The drop in performance, therefore, comes from the fact that updating a NN n times with a batch size of m is not the same as updating it 1 time with a batch size of n*m. In fact, this is also the reason why the multi-processing in the colab is so much faster than the single core, despite the fact that colab only uses max 2 cores. The networks are simply updated less often with an increasing number of n_envs.

Okay, cool. Now that we understood this part, let's try another algorithm, with adjustable batch size, namely PPO. We change ALGO = PPO and set the model parameters to

model = ALGO('MlpPolicy', train_env, verbose=0, n_steps=16, batch_size=16)

The n_timesteps is 5000. With 1 environment, we will collect 312 (and a half?) rollouts and update the policy once after each rollout, resulting in 312 total policy updates with a batch size of 16. With 4 environments, we will collect 78 rollouts and update the policy 4 times after each rollout, resulting in 312 total policy updates with a batch size of 16. Please note that the total number of policy updates is the same but the updates are more condensed at less frequent time steps.

And here are the results:

As you can see, more environments in parallel do not lead to a worse performance! The results even suggest the opposite, but that is a statement that certainly cannot be made in general.

@araffin I would love to hear your opinion on this one :) It would be cool if we could have a short explanation in the docs!

araffin commented 1 year ago

The drop in performance, therefore, comes from the fact that updating a NN n times with a batch size of m is not the same as updating it 1 time with a batch size of n*m.

yes, you can account for that by using either a larger learning rate (scaling it by sqtr(new_batch_size/old_batch_size)) or adjusting the n_steps (but then it will make the monte-carlo estimate less accurate for value approximation)

The results even suggest the opposite, but that is a statement that certainly cannot be made in general.

yes, as we say in the colab, we more environments you will also explore more, so it will take usually more time to converge but it may converge to a better final performance.

george-adams1 commented 11 months ago

yes, you can account for that by using either a larger learning rate (scaling it by sqtr(new_batch_size/old_batch_size))

@araffin is this logic for scaling learning rate also applicable to PPO?

DLR-RM / stable-baselines3