SB2 vs SB3 - Performance difference

MatPoliquin commented 2 years ago

❓ Question

EDIT: After doing some more digging I updated the post title and added more details with a newer version of SB3 (1.6.2)

I am using OpenAI gym-retro env to train on games and migrated from SB2 to SB3 1.6.2. I noticed the training FPS reduced by a lot from 1300fps to 900fps.

gym-retro env: Pong-Atari2600
num_env==24
PPO
CnnPolicy

Using Nvidia Nsight I profiled both versions (you can find the reports in the link to google drive below, you need Nsight to view it): https://drive.google.com/drive/folders/1Lqxf-qKXTj__Hp8WUXgNHejZaJGy8oct?usp=sharing

Here are the parameters I use for PPO with SB3 (with SB1 I just use the default parameters provided by SB): PPO(policy=args.nn, env=env, verbose=1, n_steps = 128, n_epochs = 4, batch_size = 256, learning_rate = 2.5e-4, clip_range = 0.2, vf_coef = 0.5, ent_coef = 0.01, max_grad_norm=0.5, clip_range_vf=None)

My specs:

Dual Xeon 2666v3
RTX 2060 Super 8g
Ubuntu 20.04
stable-baselines3 1.6.2
gym 0.26.2

Code I use to wrap the retro env (same for both SB2 and SB3 cases):

 def make_retro(*, game, state=None, num_players, max_episode_steps=4500, **kwargs):
      import retro
      if state is None:
         state = retro.State.DEFAULT
     env = retro.make(game, state, **kwargs, players=num_players)
     return env

def init_env(output_path, num_env, state, num_players, args, use_frameskip=True, use_display=False):
     seed = 0
     start_index = 0
     start_method=None
     allow_early_resets=True

     def make_env(rank):
        def _thunk():
             env = make_retro(game=args.env, use_restricted_actions=retro.Actions.FILTERED, state=state, num_players=num_players)

             env.seed(seed + rank)
             env = Monitor(env, output_path and os.path.join(output_path, str(rank)), allow_early_resets=allow_early_resets)
             if use_frameskip:
                 env = StochasticFrameSkip(env, n=4, stickprob=0.25)

             env = WarpFrame(env)
             env = ClipRewardEnv(env)

             return env
         return _thunk
     #set_global_seeds(seed)

     env = SubprocVecEnv([make_env(i + start_index) for i in range(num_env)], start_method=start_method)

     env = VecFrameStack(env, n_stack=4)

     env = VecTransposeImage(env)

     return env

Checklist

[x] I have checked that there is no similar issue in the repo
[X] I have read the documentation
[X] If code there is, it is minimal and working
[X] If code there is, it is formatted using the markdown code blocks for both code and stack traces.

araffin commented 2 years ago

Hello, Please provide a minimal code to reproduce the issue. I guess you are not using SubprocVecEnv ? (you should try)

I noticed the training FPS reduced by a lot from 1300fps to 900fps.

there might be a performance drop because pytorch uses eager evaluation but probably not that much.

It seems VecTransposeImage has a high CPU usage (as expected for a large number of envs==24). Is there plans to do this operation on GPU instead?

are you sure it is due to VecTransposeImage ? and that the slowness is due to it? One thing you can try is setting OMP_NUM_THREADS to a lower value (start with 1), see https://github.com/DLR-RM/stable-baselines3/issues/413 and https://github.com/DLR-RM/stable-baselines3/issues/283

MatPoliquin commented 2 years ago

Hello, Please provide a minimal code to reproduce the issue. I guess you are not using SubprocVecEnv ? (you should try)

I already use SubprocVecEnv, I edited my post to add the code that setups the env

I noticed the training FPS reduced by a lot from 1300fps to 900fps.

there might be a performance drop because pytorch uses eager evaluation but probably not that much.

Good point, might explain at least part of it

It seems VecTransposeImage has a high CPU usage (as expected for a large number of envs==24). Is there plans to do this operation on GPU instead?

are you sure it is due to VecTransposeImage ? and that the slowness is due to it? One thing you can try is setting OMP_NUM_THREADS to a lower value (start with 1), see #413 and #283

I tried to set OMP_NUM_THREADS to lower values but it doesn't make much of a difference since I use the GPU. It only makes a difference if I force pytorch to use the CPU, as expected.

araffin commented 2 years ago

Related: https://github.com/DLR-RM/stable-baselines3/issues/90 and https://github.com/DLR-RM/stable-baselines3/issues/122#issuecomment-1065057830 (there I provide colab notebook to compare)

I noticed the training FPS reduced by a lot from 1300fps to 900fps.

Did you double check that the hyperparameters were equivalents? what were you using for SB2 PPO?

EDIT: the 1.4x difference seems to match the results I got with colab notebooks

MatPoliquin commented 2 years ago

Related: #90 and #122 (comment) (there I provide colab notebook to compare)

I noticed the training FPS reduced by a lot from 1300fps to 900fps.

Did you double check that the hyperparameters were equivalents? what were you using for SB2 PPO? I was using the default parameters: https://github.com/hill-a/stable-baselines/blob/45beb246833b6818e0f3fc1f44336b1c52351170/stable_baselines/ppo2/ppo2.py#L53

The only parameter I am not sure about is batch_size, I experimented with different values (256, 512, 1024) and the performance is still lower

EDIT: the 1.4x difference seems to match the results I got with colab notebooks Interesting, did not see this, so basically my results are probably normal

araffin commented 2 years ago

The only parameter I am not sure about is batch_size, I experimented with different values (256, 512, 1024) and the performance is still lower

See conversion for batch size: https://stable-baselines3.readthedocs.io/en/master/guide/migration.html#ppo

Interesting, did not see this, so basically my results are probably normal

yes...

araffin commented 2 years ago

Reading https://pytorch.org/blog/accelerating-pytorch-vision-models-with-channels-last-on-cpu/, we should probably try channel last memory format, it is just few lines of code to change (https://pytorch.org/blog/tensor-memory-format-matters/) and the shape of the tensors are the same.

araffin commented 2 years ago

@MatPoliquin all you need to do is apparently:

x = x.to(memory_format=torch.channels_last)
model = model.to(memory_format=torch.channels_last)

I would be happy to receive your feedback if you give it a try ;)

MatPoliquin commented 2 years ago

@MatPoliquin all you need to do is apparently:
x = x.to(memory_format=torch.channels_last)
model = model.to(memory_format=torch.channels_last)
I would be happy to receive your feedback if you give it a try ;)

So these changes should be made in on_policy_algorithm.py?

I modified the code below but not quite sure if it's correct

line 102:

def _setup_model(self) -> None:
        self._setup_lr_schedule()
        self.set_random_seed(self.seed)

        buffer_cls = DictRolloutBuffer if isinstance(self.observation_space, gym.spaces.Dict) else RolloutBuffer

        self.rollout_buffer = buffer_cls(
            self.n_steps,
            self.observation_space,
            self.action_space,
            device=self.device,
            gamma=self.gamma,
            gae_lambda=self.gae_lambda,
            n_envs=self.n_envs,
        )

        self.rollout_buffer = self.rollout_buffer.to(memory_format=torch.channels_last)

        self.policy = self.policy_class(  # pytype:disable=not-instantiable
            self.observation_space,
            self.action_space,
            self.lr_schedule,
            use_sde=self.use_sde,
            **self.policy_kwargs  # pytype:disable=not-instantiable
        )
        self.policy = self.policy.to(self.device, memory_format=torch.channels_last)

araffin commented 2 years ago

gym 0.26.2

You are using the experimental branch right? Otherwise, SB3 is only compatible with gym 0.21 for now.

I modified the code below but not quite sure if it's correct

yes and you need to modify the rollout buffer.

I did some quick tests with the RL Zoo (default to 8 envs), what I can recommend:

use subprocess vec env
limit the number of cpu threads used
try without GPU

For instance, with the default command, I get around 800FPS: python train.py --algo ppo --env PongNoFrameskip-v4 -P --verbose 0 --eval-freq 100000

With subprocess, I get 1100 FPS: `OMP_NUM_THREADS=4 python train.py --algo ppo --env PongNoFrameskip-v4 -P --verbose 0 --eval-freq 100000 --vec-env subproc

You could also try to add support for CNN in the experimental SBX; https://github.com/araffin/sbx/pull/6 and https://github.com/araffin/sbx/pull/4

(SBX PPO is ~2x faster than SB3 PPO but it has less features)

araffin commented 1 year ago

Reading https://pytorch.org/blog/accelerating-pytorch-vision-models-with-channels-last-on-cpu/, we should probably try channel last memory format, it is just few lines of code to change (https://pytorch.org/blog/tensor-memory-format-matters/) and the shape of the tensors are the same.

So, I tested that but didn't help much.

What gave me 8% speed boost was to set copy=False when creating tensor (see https://github.com/DLR-RM/stable-baselines3/compare/feat/non-blocking?expand=1). With that and subprocess envs, I can get ~1800 FPS using

python -m rl_zoo3.train --algo ppo --env PongNoFrameskip-v4 --verbose 0 -P --seed 1 -n 60000 --vec-env subproc --eval-freq -1

araffin commented 1 year ago

Small update for that, I have now an experimental SB3 + Jax = SBX version here: https://github.com/araffin/sbx

With the proper hyperparameter, SAC can run 20x faster than its PyTorch equivalent =): https://twitter.com/araffin2/status/1590714601754497024

DLR-RM / stable-baselines3

SB2 vs SB3 - Performance difference #1124

❓ Question

Checklist