Closed Darklanx closed 2 years ago
Moving this to rl-baseline3-zoo
I have faced the same problem with PPO. Did you solve the issue?
Hello,
@araffin pointed out this Colab https://colab.research.google.com/github/araffin/rl-tutorial-jnrr19/blob/sb3/3_multiprocessing.ipynb in another similar issue (I cannot find it anymore, sorry.)
Tl;dr of this colab was that increasing the number of n_envs
while keeping n_timesteps
constant, leads to performance drops.
To understand this issue, we have to understand how rollout collection and policy training work in SB3 for on-policy algorithms. Your training loop (https://github.com/DLR-RM/stable-baselines3/blob/e39bc3da00c49413b765176af1b95f2361a35098/stable_baselines3/common/on_policy_algorithm.py#L246) looks like this:
while self.num_timesteps < total_timesteps:
self.collect_rollouts(self.env, self.rollout_buffer, n_rollout_steps=self.n_steps)
self.train()
As you can see, we will first collect rollouts and then train the agent until the training is finished.
Now, let's have a look at collect_rollouts
(https://github.com/DLR-RM/stable-baselines3/blob/e39bc3da00c49413b765176af1b95f2361a35098/stable_baselines3/common/on_policy_algorithm.py#L158) -> strongly simplified:
for i in range(self.n_steps):
actions = self.policy(obs)
obs, r, d, infos = env.step(actions) # execute 1 step in each of the num_envs environments
self.num_timesteps += env.num_envs
self.rollout_buffer.add(obs, r, d, infos, actions)
As you can see in each rollout collection there will be n_steps*n_envs
transitions added to the rollout_buffer. Small example: If you have n_envs=8
, n_steps=10
, n_timesteps=800
, then you will call the collect_rollouts
function 10
times, and your rollout buffer has size 80
(flattend). Therefore, you will also only call your train
function 10
times.
So, if we increase the number of environments and keep the step size the same, we will call the train
function less often but with a larger rollout buffer.
A2C is a very special algorithm in this regard, as it does not have a batch size! A2C will always use the full rollout buffer size as a batch size. The drop in performance, therefore, comes from the fact that updating a NN n
times with a batch size of m
is not the same as updating it 1
time with a batch size of n*m
. In fact, this is also the reason why the multi-processing in the colab is so much faster than the single core, despite the fact that colab only uses max 2 cores. The networks are simply updated less often with an increasing number of n_envs
.
Okay, cool. Now that we understood this part, let's try another algorithm, with adjustable batch size, namely PPO.
We change ALGO = PPO
and set the model parameters to
model = ALGO('MlpPolicy', train_env, verbose=0, n_steps=16, batch_size=16)
The n_timesteps
is 5000
.
With 1 environment, we will collect 312 (and a half?) rollouts and update the policy once after each rollout, resulting in 312 total policy updates with a batch size of 16.
With 4 environments, we will collect 78 rollouts and update the policy 4 times after each rollout, resulting in 312 total policy updates with a batch size of 16.
Please note that the total number of policy updates is the same but the updates are more condensed at less frequent time steps.
And here are the results:
As you can see, more environments in parallel do not lead to a worse performance! The results even suggest the opposite, but that is a statement that certainly cannot be made in general.
@araffin I would love to hear your opinion on this one :) It would be cool if we could have a short explanation in the docs!
The drop in performance, therefore, comes from the fact that updating a NN n times with a batch size of m is not the same as updating it 1 time with a batch size of n*m.
yes, you can account for that by using either a larger learning rate (scaling it by sqtr(new_batch_size/old_batch_size)
) or adjusting the n_steps
(but then it will make the monte-carlo estimate less accurate for value approximation)
The results even suggest the opposite, but that is a statement that certainly cannot be made in general.
yes, as we say in the colab, we more environments you will also explore more, so it will take usually more time to converge but it may converge to a better final performance.
yes, you can account for that by using either a larger learning rate (scaling it by sqtr(new_batch_size/old_batch_size))
@araffin is this logic for scaling learning rate also applicable to PPO?
Question
I'm running a2c with default parameters on BreakoutNoFrameskip-v4, with two different training scenario, where the only difference is that one uses![image](https://user-images.githubusercontent.com/26689157/153244495-a7b90c96-5f3f-47eb-90fb-1b46adac635d.png)
n_envs=16
(orange) while the other one setn_envs=40
(blue). However the performance difference is huge. Is there a particular reason behind this behavior? I thoughtn_envs
is more of a parallelization parameter, which shouldn't have such a huge impact on performance.Additional context
config.yml:
args.yml:
Checklist