Closed AlessandroZavoli closed 2 years ago
SB3 is in active development whereas SB2(SB) is in maintenance mode. I use SB3 for my projects since it is more modular and less cluttered than SB2 simply because of dynamic computation graphs, the experience collected by the members in implementing the algorithms, and a well thought out design.
To add to comment above, some of the methods are as of writing slower (at least without tuning e.g. number of threads), but we are still in process of going over them and optimizing for speed and matching the performance of SB2 implementations.
Hello,
I'm glad that you ask ;)
As mentioned by @PartiallyTyped , SB3 is now the project actively developed by the maintainers. It does not have all the features of SB2 (yet) but is already ready for most use cases.
Did anybody compare the training speed (or other performance metrics) of SB and SB3 for the implemented algorithms (e.g., PPO?)
We have two related issues for that: #49 #48
The algorithms have been benchmarked recently in a paper for the continuous case and I have already successfully used SAC on real robots.
Because PyTorch uses dynamic graph, you have to expect a small slow down (we plan to use the jit improve the speed in the future #57 ) and you may have to play with torch.set_num_thread()
to have the best speed. One exception is DQN which is significantly faster in SB3 because of the new replay buffer implementation.
Is there a reason to prefer either one for developing a new project?
The main advantage of SB3 is that it was re-built (almost) from scratch, trying not to reproduce the errors made in SB2. That means much clearer code, more test coverage and higher quality standard (with the use of typing notably). Unless you need to use RNN, I would highly recommend you to use SB3.
If you change the internals, you may expect some changes (they will be documented anyway) until the v1.0 is released (see issue #1 and code review #17 ). If you use only the "user api" (without changing the internals), then not much should change and I would highly recommend you to use the rl zoo that should cover most needs (and that is up to date with the best practices for using SB3).
It is also in the roadmap to document the differences between SB2 and SB3.
Last thing, for SB3 vs other pytorch libraries: https://github.com/DLR-RM/stable-baselines3/issues/20
Hello,
I used SB2 for training with SAC and now switched to SB3. SB3 implementation is right now around 2.5X slower than that of the SB2 with almost the same set of (hyper)parameters. Is this something we should expect or something is wrong in my environment and/or code?
Many thanks, Reza
Hi Reza, There are certainly some low hanging fruits that will result in better performance and some discussion on using torch’s jit. There were some changes to continuous methods (TD3/SAC) so be sure to check those out.
On 9 Jul 2020, at 16:31, RezaSwe notifications@github.com wrote:
Hello,
I used SB2 for training with SAC and now switched to SB3. SB3 implementation is right now around 2.5X slower than that of the SB2 with almost the same set of (hyper)parameters. Is this something we should expect or something is wrong in my environment and/or code?
Many thanks, Reza
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
Hi PartiallyTyped,
Thanks for your quick reply!
Do you have any idea where is the best to put play with torch.set_num_thread() ? Really appreciate if you can comment on that.
BR, Reza
Do you have any idea where is the best to put play with torch.set_num_thread() ? Really appreciate if you can comment on that.
Before creating the model, or if you are using the rl zoo, you can pass it as an argument to the script (--num-threads
).
I used SB2 for training with SAC and now switched to SB3. SB3 implementation is right now around 2.5X slower than that of the SB2 with almost the same set of (hyper)parameters. Is this something we should expect or something is wrong in my environment and/or code?
Is it CPU only?
If so, you should play a bit with th.set_num_threads()
Thanks araffin for your reply!
Since I am using gym not zoo, I tried to use th.set_num_threads() before creating the model. I got this error message:
"MemoryError: Unable to allocate 2.12 GiB for an array with shape (1000000, 1, 568) and data type float32"
Does this show that I do not have enough mem available? I tried with different numbers, yet I always got the same error message.
Since I am using gym not zoo
Gym and rl zoo are two completely different things (cf doc). You can use the rl zoo to train agents on gym environments.
Does this show that I do not have enough mem available?
yes, you don't have enough RAM. But this is off-topic.
Hello,
I used SB2 for training with SAC and now switched to SB3. SB3 implementation is right now around 2.5X slower than that of the SB2 with almost the same set of (hyper)parameters. Is this something we should expect or something is wrong in my environment and/or code?
Many thanks, Reza
thinking about that again, are you sure the network is the same? The default MLP policy of SB3 for SAC is bigger to match original paper. All those differences will be documented in the near future (see roadmap #1)
Hi arafinn,
Thanks for asking.
I am changing the default network architecture to get similar nets in SB2 and SB3. Basically, in SB3, I use net_arch=[700, 700, 250] and in SB2 I use layers=[700, 700, 250]. Does this lead to the same net as I am assuming now?
Best regards, Reza
I just did a comparison between SB1 and SB3. Same PC, same environment and callback. The only difference is that with SB3 I'm using (finally) my cuda gpu (1050 TI). Well, SB1 without GPU give ~900 fps while SB3 with GPU give ~190. There should be definitely a low hanging fruit someplace.
Just wanted to mention sample factory (https://venturebeat.com/2020/06/24/intels-sample-factory-speeds-up-reinforcement-learning-training-on-a-single-pc), I get ~3500 on the same (2-core 6 yo pc) hardware as above. Managed to get a lot more on a multi-core server.
@jarlva
Yup, SB3 is still semi-unoptimized and first goal is to achieve the same performance as SB2. One quick trick you could try is setting environment variable OMP_NUM_THREADS=1
(or same via pytorch), which in some cases drastically increases the speed.
I'd like to highlight that SB will never achieve same speeds as sample-factory, as that one is specifically designed for high frames-per-second and implements algorithms designed for that (i.e. IMPALA). Stable-baselines focuses on synchronous execution.
@jarlva
Because sb3 is built using pytorch, there is some expected and unavoidable slowdown simply due to python. We discussed a bit about using pytorch's jit here #57.
If you'd like to get your hands dirty, you could compile at least some parts like the replay buffers with numba and jit, but it isn't supported.
I also keep avoiding #93 ;)
Thanks @PartiallyTyped, just to clarify, SF is also using pytortch. I think @Miffyli is correct.
Thanks again for everyone's response!
I was referring to relative performance between identical/same scope torch and tf implementations. @Miffyli is indeed correct.
The effect of th.set_num_threads()
and #106 on a simple example (SAC on Pendulum-v0 with a small network) on cpu only:
The first group (around 100 FPS) is with num_threads=2
and the second one (around 50FPS) is the default (I have 8 cores).
There is 2x boost.
And each time, the run with #106 is 10% faster, except when num_thread=1
(not shown here)
Relevant, I am getting some rather weird performance from DQN, it seems to reach 0 fps (it was with num_threads=1, and old polyak update). When using an ensemble of 10 estimators I got much better performance and I can't pinpoint the issue.
what do you call n_estimators
?
In the policy, instead of having a single Qnetwork
, I have n_estimator
identical QNetworks and their estimation is averaged.
Note, this was running on GPU and the environment was LunarLander.
ah ok, please move this discussion to #49 then.
As mentioned here https://github.com/DLR-RM/stable-baselines3/issues/122#issuecomment-666521802 you should consider upgrading pytorch ;) There was a huge gain (20% faster) in the latest release. The gap is filled when setting the number of threads manually.
EDIT: apparently on cpu only
On a related note, I migrated from SB2 to SB3 and the training is taking 24 times longer (same custom environment + PPO + default hyperparameters + 100000 time steps + 8 parallel environments)... I did play with the --num-threads
argument in the train.py script from the RL Zoo and I found the most efficient number to be 6 but it only reduced the training time by 3%.
Any suggestions would be welcome, otherwise I might just switch back to SB2 until I find a better solution.
I'm using Pytorch with CUDA support.
Please read our migration guide (if you did not already): the default hyperparameters are not the same (tuned for Atari in SB2 vs tuned for continuous actions in SB3) i'm surprised by the slowdown... i would appreciate if you could provide a minimal example to reproduce.
EDIT: I did two quick tests using the zoo (SB2 and SB3) 8 envs and two environments (CartPole-v1, Breakout) and SB3 was ~2x slower on CartPole but 1.2x faster on Breakout this was cpu only
Thanks for the suggestions. I couldn't reproduce the 24 times slowdown but I prepared a minimal example where the training is taking 4x longer on my custom environment (and 2.6x on CartPole-v1). The instructions are on the readme but let me know if you can't reproduce. This is not too bad of a slowdown, I must have done something wrong previously.
I couldn't reproduce the 24 times slowdown but I prepared a minimal example where the training is taking 4x longer on my custom environment (and 2.6x on CartPole-v1)
thanks for setup that up =) After a quick check, it seems that you are using the default hyperparameters that are different from PPO2 to SB3 PPO (cf migration guide https://stable-baselines3.readthedocs.io/en/master/guide/migration.html#ppo). If you want to have the same hyperparameters in SB3, you would to do:
widowx_reacher-v1:
n_timesteps: 100000
normalize: true
policy: 'MlpPolicy'
n_envs: 8
n_steps: 128
n_epochs: 4
batch_size: 256
n_timesteps: !!float 1e7
learning_rate: !!float 2.5e-4
clip_range: 0.2
vf_coef: 0.5
ent_coef: 0.01
I would also advise you to deactivate the vf clipping in SB2.
Note that SB2 n_minibatches
lead to a batch size that depends on the number of envs which is not the case anymore.
EDIT: @PierreExeter I ran your env with the same hyperparams and got 39s (SB3) vs 39s (SB2), so the same time (cpu only) with 1 thread only
You're right, it was an issue with the hyperparameters. I also got a training time of 36s when using the SB2 default hyperparameters. I optimised the hyperparameters with Optuna and this gave me a training time of 18 minutes... I didn't realise that the hyperparameters could have such a strong effect on the training time. Thanks a lot for your useful inputs.
For latest comparison, please take a look at https://github.com/DLR-RM/stable-baselines3/issues/122#issuecomment-1065057830
Did anybody compare the training speed (or other performance metrics) of SB and SB3 for the implemented algorithms (e.g., PPO?) Is there a reason to prefer either one for developing a new project?