Closed MatPoliquin closed 1 year ago
Hello,
Please provide a minimal code to reproduce the issue.
I guess you are not using SubprocVecEnv
? (you should try)
I noticed the training FPS reduced by a lot from 1300fps to 900fps.
there might be a performance drop because pytorch uses eager evaluation but probably not that much.
It seems VecTransposeImage has a high CPU usage (as expected for a large number of envs==24). Is there plans to do this operation on GPU instead?
are you sure it is due to VecTransposeImage
? and that the slowness is due to it?
One thing you can try is setting OMP_NUM_THREADS
to a lower value (start with 1), see https://github.com/DLR-RM/stable-baselines3/issues/413 and https://github.com/DLR-RM/stable-baselines3/issues/283
Hello, Please provide a minimal code to reproduce the issue. I guess you are not using
SubprocVecEnv
? (you should try)
I already use SubprocVecEnv, I edited my post to add the code that setups the env
I noticed the training FPS reduced by a lot from 1300fps to 900fps.
there might be a performance drop because pytorch uses eager evaluation but probably not that much.
Good point, might explain at least part of it
It seems VecTransposeImage has a high CPU usage (as expected for a large number of envs==24). Is there plans to do this operation on GPU instead?
are you sure it is due to
VecTransposeImage
? and that the slowness is due to it? One thing you can try is settingOMP_NUM_THREADS
to a lower value (start with 1), see #413 and #283
I tried to set OMP_NUM_THREADS to lower values but it doesn't make much of a difference since I use the GPU. It only makes a difference if I force pytorch to use the CPU, as expected.
Related: https://github.com/DLR-RM/stable-baselines3/issues/90 and https://github.com/DLR-RM/stable-baselines3/issues/122#issuecomment-1065057830 (there I provide colab notebook to compare)
I noticed the training FPS reduced by a lot from 1300fps to 900fps.
Did you double check that the hyperparameters were equivalents? what were you using for SB2 PPO?
EDIT: the 1.4x difference seems to match the results I got with colab notebooks
Related: #90 and #122 (comment) (there I provide colab notebook to compare)
I noticed the training FPS reduced by a lot from 1300fps to 900fps.
Did you double check that the hyperparameters were equivalents? what were you using for SB2 PPO? I was using the default parameters: https://github.com/hill-a/stable-baselines/blob/45beb246833b6818e0f3fc1f44336b1c52351170/stable_baselines/ppo2/ppo2.py#L53
The only parameter I am not sure about is batch_size, I experimented with different values (256, 512, 1024) and the performance is still lower
EDIT: the 1.4x difference seems to match the results I got with colab notebooks Interesting, did not see this, so basically my results are probably normal
The only parameter I am not sure about is batch_size, I experimented with different values (256, 512, 1024) and the performance is still lower
See conversion for batch size: https://stable-baselines3.readthedocs.io/en/master/guide/migration.html#ppo
Interesting, did not see this, so basically my results are probably normal
yes...
Reading https://pytorch.org/blog/accelerating-pytorch-vision-models-with-channels-last-on-cpu/, we should probably try channel last memory format, it is just few lines of code to change (https://pytorch.org/blog/tensor-memory-format-matters/) and the shape of the tensors are the same.
@MatPoliquin all you need to do is apparently:
x = x.to(memory_format=torch.channels_last)
model = model.to(memory_format=torch.channels_last)
I would be happy to receive your feedback if you give it a try ;)
@MatPoliquin all you need to do is apparently:
x = x.to(memory_format=torch.channels_last) model = model.to(memory_format=torch.channels_last)
I would be happy to receive your feedback if you give it a try ;)
So these changes should be made in on_policy_algorithm.py?
I modified the code below but not quite sure if it's correct
line 102:
def _setup_model(self) -> None:
self._setup_lr_schedule()
self.set_random_seed(self.seed)
buffer_cls = DictRolloutBuffer if isinstance(self.observation_space, gym.spaces.Dict) else RolloutBuffer
self.rollout_buffer = buffer_cls(
self.n_steps,
self.observation_space,
self.action_space,
device=self.device,
gamma=self.gamma,
gae_lambda=self.gae_lambda,
n_envs=self.n_envs,
)
self.rollout_buffer = self.rollout_buffer.to(memory_format=torch.channels_last)
self.policy = self.policy_class( # pytype:disable=not-instantiable
self.observation_space,
self.action_space,
self.lr_schedule,
use_sde=self.use_sde,
**self.policy_kwargs # pytype:disable=not-instantiable
)
self.policy = self.policy.to(self.device, memory_format=torch.channels_last)
gym 0.26.2
You are using the experimental branch right? Otherwise, SB3 is only compatible with gym 0.21 for now.
I modified the code below but not quite sure if it's correct
yes and you need to modify the rollout buffer.
I did some quick tests with the RL Zoo (default to 8 envs), what I can recommend:
For instance, with the default command, I get around 800FPS:
python train.py --algo ppo --env PongNoFrameskip-v4 -P --verbose 0 --eval-freq 100000
With subprocess, I get 1100 FPS: `OMP_NUM_THREADS=4 python train.py --algo ppo --env PongNoFrameskip-v4 -P --verbose 0 --eval-freq 100000 --vec-env subproc
You could also try to add support for CNN in the experimental SBX; https://github.com/araffin/sbx/pull/6 and https://github.com/araffin/sbx/pull/4
(SBX PPO is ~2x faster than SB3 PPO but it has less features)
Reading https://pytorch.org/blog/accelerating-pytorch-vision-models-with-channels-last-on-cpu/, we should probably try channel last memory format, it is just few lines of code to change (https://pytorch.org/blog/tensor-memory-format-matters/) and the shape of the tensors are the same.
So, I tested that but didn't help much.
What gave me 8% speed boost was to set copy=False
when creating tensor (see https://github.com/DLR-RM/stable-baselines3/compare/feat/non-blocking?expand=1).
With that and subprocess envs, I can get ~1800 FPS using
python -m rl_zoo3.train --algo ppo --env PongNoFrameskip-v4 --verbose 0 -P --seed 1 -n 60000 --vec-env subproc --eval-freq -1
Small update for that, I have now an experimental SB3 + Jax = SBX version here: https://github.com/araffin/sbx
With the proper hyperparameter, SAC can run 20x faster than its PyTorch equivalent =): https://twitter.com/araffin2/status/1590714601754497024
❓ Question
EDIT: After doing some more digging I updated the post title and added more details with a newer version of SB3 (1.6.2)
I am using OpenAI gym-retro env to train on games and migrated from SB2 to SB3 1.6.2. I noticed the training FPS reduced by a lot from 1300fps to 900fps.
Using Nvidia Nsight I profiled both versions (you can find the reports in the link to google drive below, you need Nsight to view it): https://drive.google.com/drive/folders/1Lqxf-qKXTj__Hp8WUXgNHejZaJGy8oct?usp=sharing
Here are the parameters I use for PPO with SB3 (with SB1 I just use the default parameters provided by SB):
PPO(policy=args.nn, env=env, verbose=1, n_steps = 128, n_epochs = 4, batch_size = 256, learning_rate = 2.5e-4, clip_range = 0.2, vf_coef = 0.5, ent_coef = 0.01, max_grad_norm=0.5, clip_range_vf=None)
My specs:
Code I use to wrap the retro env (same for both SB2 and SB3 cases):
Checklist