Roadmap to Stable-Baselines3 V1.0

araffin commented 4 years ago

This issue is meant to be updated as the list of changes is not exhaustive

Dear all,

Stable-Baselines3 beta is now out :tada: ! This issue is meant to reference what is implemented and what is missing before a first major version.

As mentioned in the README, before v1.0, breaking changes may occur. I would like to encourage contributors (especially the maintainers) to make comments on how to improve the library before v1.0 (and maybe make some internal changes).

I will try to review the features mentioned in https://github.com/hill-a/stable-baselines/issues/576 (and https://github.com/hill-a/stable-baselines/issues/733) and I will create issues soon to reference what is missing.

What is implemented?

[x] basic features (training/saving/loading/predict)
[x] basic set of algorithms (A2C/PPO/SAC/TD3)
[x] basic pre-processing (Box and Discrete observation/action spaces are handled)
[x] callback support
[x] complete benchmark for the continuous action case
[x] basic rl zoo for training/evaluating plotting (https://github.com/DLR-RM/rl-baselines3-zoo)
[x] consistent api
[x] basic tests and most type hints
[x] continuous integration (I'm in discussion with the organization admins for that)
[x] handle more observation/action spaces #4 and #5 (thanks @rolandgvc)
[x] tensorboard integration #9 (thanks @rolandgvc)
[x] basic documentation and notebooks
[x] automatic build of the documentation
[x] Vanilla DQN #6 (thanks @Artemis-Skade)
[x] Refactor off-policy critics to reduce code duplication #3 (see #78 )
[x] DDPG #3
[x] do a complete benchmark for the discrete case #49 (thanks @Miffyli !)
[x] performance check for continuous actions #48 (even better than gSDE paper)
[x] get/set parameters for the base class (#138 )
[x] clean up type-hints in docs #10 (cumbersome to read)
[x] documenting the migration between SB and SB3 #11
[x] finish typing some methods #175
[x] HER #8 (thanks @megan-klaiber)
[x] finishing to update and clean the doc #166 (help is wanted)
[x] finishing to update the notebooks and the tutorial #7 (I will do that, only HER notebook missing)

What are the new features?

[x] much cleaner base code (and no more warnings =D )
[x] independent saving/loading/predict for policies
[x] State-Dependent Exploration (SDE) for using RL directly on real robots (this is a unique feature, it was the starting point of SB3, I published a paper on that: https://arxiv.org/abs/2005.05719)
[x] proper evaluation (using separate env) is included in the base class (using EvalCallback)
[x] all environments are VecEnv
[x] better saving/loading (now can include the replay buffer and the optimizers)
[x] any number of critics are allowed for SAC/TD3
[x] custom actor/critic net arch for off-policy algos (#113 )
[x] QR-DQN in SB3-Contrib
[x] Truncated Quantile Critics (TQC) (see #83 ) in SB3-Contrib
@Miffyli suggested a "contrib" repo for experimental features (it is here)

What is missing?

[x] syncing some files with Stable-Baselines to remain consistent (we may be good now, but need to be checked)
[x] finish code-review of exisiting code #17

Checklist for v1.0 release

[x] Update Readme
[x] Prepare blog post
[x] Update doc: add links to the stable-baselines3 contrib
[x] Update docker image to use newer Ubuntu version
[x] Populate RL zoo

What is next? (for V1.1+)

basic dict/tuple support for observations (#243 )
simple recurrent policies? (https://github.com/DLR-RM/stable-baselines3/issues/18)
DQN extensions (double, PER, IQN) (https://github.com/DLR-RM/stable-baselines3/issues/622)
Implement TRPO (https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/pull/40)
multi-worker training for all algorithms (#179 )
n-step returns for off-policy algorithms #47 (@PartiallyTyped )
SAC discrete #157 (need to be discussed, benefit vs DQN+extensions?)
Energy Based Prioritisation? (@RyanRizzo96)
implement action_proba in the base class?
test the doc snippets #14 (help is welcomed)
noisy networks (https://arxiv.org/abs/1706.10295) @PartiallyTyped ? exploration in parameter space? (https://github.com/DLR-RM/stable-baselines3/issues/622)
Munchausen Reinforcement Learning (MDQN) (probably in the contrib first, e.g. https://github.com/pfnet/pfrl/pull/74)

side note: should we change the default start_method to fork? (now that we don't have tf anymore)

m-rph commented 4 years ago

Maybe N-step returns for TD3 (and DDPG) and DQN (and friends) ? If it's implemented in the experience replay, then it is likely plug and play for TD3 and DQN, an implementation for SAC probably requires extra effort.

Perhaps at a later time, e.g. V1.1+ retrace, tree backup, Q(lambda), Importance Sampling for n-step returns? If retrace and friends are planned for later, then it should be taken into consideration when implementing n-steps.

Miffyli commented 4 years ago

@PartiallyTyped

Yup, that would be v1.1 thing but indeed planned. Should probably go over original SB issues to gather all these suggestions at some point.

m-rph commented 4 years ago

Perhaps a discrete version of SAC for v1.1+? https://arxiv.org/abs/1910.07207

Edit: I can implement this, and add types to the remaining methods after my finals (early June).

rolandgvc commented 4 years ago

I will start working on the additional observation/action spaces this weekend 👍

jkterry1 commented 4 years ago

Will the stable baselines 3 repo/package replace the existing stable baselines one, or will all this eventually be merged into the normal stable baselines repo?

Miffyli commented 4 years ago

@justinkterry

There are no plans to merge/combine the two repositories. Stable-baselines will continue to exist, and continue to receive bug-fixes and the like for some time before it is archived.

jkterry1 commented 4 years ago

@Miffyli Thank you. Will it remain as "pip3 install stable-baselines," or become something like "pip3 install stable-baselines3"?

Miffyli commented 4 years ago

@justinkterry

You can already install sb3 with pip3 install stable-baselines3. The original repo will stay as pip3 install stable-baselines.

AdamGleave commented 4 years ago

Minor point but I wonder if we should rename BaseRLModel to BaseRLAlgorithm and BasePolicy to BaseModel, given that BasePolicy is more than just a policy?

araffin commented 4 years ago

Minor point but I wonder if we should rename BaseRLModel to BaseRLAlgorithm and BasePolicy to BaseModel, given that BasePolicy is more than just a policy?

Good point, BaseModel and BaseRLAlgorithm are definitely better names ;)

jdily commented 4 years ago

for visualization, probably using something like weights & biases (https://www.wandb.com/) is an option? So that no need for tensorflow dependency. I can help to add functions to do that.

araffin commented 4 years ago

for visualization, probably using something like weights & biases (https://www.wandb.com/) is an option?

correct me if I'm wrong but W&B does not work offline, no? This is really important as you don't want your results to be published when you do private work.

This could be also implemented either as a callback (cf doc) or a new output for the logger. But sounds more like a "contrib" module to me.

m-rph commented 4 years ago

Perhaps an official shorthand for stable-baselines and stable-baselines3 e.g. sb and sb3?


import stable_baselines3 as sb3

ManifoldFR commented 4 years ago

Is it necessary to continue to provide the interface for vectorized environments inside of this codebase? They were contributed upstream back to gym in this PR. After that PR was merged, packages such as PyBullet (pybullet_envs) started providing vectorized variants of their own environments using the interface from gym which should be the same as the one here (for now)

Miffyli commented 4 years ago

@ManifoldFR Somehow that has eluded my attention. Looks like a good suggestion! Less repeated code is better, as long it fits in stable-baselines functions too.

@araffin thoughts (you have most experience doing the eval/wrap functions)? I imagine the hardest part is to update all wrappers that work on vectorized environments.

ManifoldFR commented 4 years ago

I happened onto it by chance because it's not documented anywhere inside of gym's docs, the openai people ported it from their own baselines repo with barely any notification of the change to end users.

araffin commented 4 years ago

I was aware of this (wrote some comments at that time https://github.com/openai/gym/pull/1513/files#r293899941) but I would argue against for different reasons:

we rely on some specific features (set_attr, get_attr)
openai version is undocumented and we don't know if they gonna break that feature (which is central in SB3) in a future release (I don't want to write new monkey patch like https://github.com/hill-a/stable-baselines/commit/678f803698de1a3735ded16f4a0c8d47e329465f)
we can directly tweak that feature to fit our needs (and don't wait for a review and release by OpenAI)

So, in short, I would be in favor only if OpenAI way of maintaining Gym was more reliable.

PS: thanks @ManifoldFR for bringing that up ;)

ManifoldFR commented 4 years ago

I see, it figures that OpenAI changing things unilaterally without documentation would be a problem.

I guess ensuring stable-baselines3's code doesn't break when running vectorized envs derived from Gym's implementation would be safer and easier instead...

m-rph commented 4 years ago

How about vectorized/stacked action noise for TD3/SAC? I am referring to https://github.com/hill-a/stable-baselines/issues/805 . This will be a useful stepping stone for multiprocessing in TD3/SAC.

The motivation behind this addition is that for OU or other noise processes we need the state to exist until the end of the trajectory, thus we can't use a single process for multiple parallel episodes as they have different lengths. Thus, stacked/vectorized processes allow us to keep the processes for as long as the particular episode goes.

Also the code is done and has types, I'd just change the *lst in the reset function to a single argument over a variadic argument.

araffin commented 4 years ago

How about vectorized/stacked action noise for TD3/SAC?

sounds good. If you have test included, you can do the PR. It won't be used for a while though... a bit like what we did for dict support inside VecEnv (was not used until HER was re-implemented).

m-rph commented 4 years ago

For v1.1+ Network Randomization appears to be useful and simple enough to implement. I will implement it for a project I am working on (Obstacle Tower) so if you believe it will be useful to have, I can open an issue specifically for it and discuss it there.

araffin commented 4 years ago

Network Randomization appears to be useful and simple enough to implement.

I would rather favor what we did for the pre-training (issue #27 ), create an example code/notebook and link to it in the documentation.

m-rph commented 4 years ago

What do you think about Noisy Linear layers for v1.0?

araffin commented 4 years ago

Noisy Linear layers

What is that?

m-rph commented 4 years ago

It's from this paper. The tl:dr is that they are linear layers that have parameter noise. This results in better exploration over e-greedy/ entropy bonus. The layers learn a mu and a logstd, then on the forward pass, they sample from the distribution.

araffin commented 4 years ago

The tl:dr is that they are linear layers that have parameter noise. This results in better exploration over e-greedy/ entropy bonus. The layers learn a mu and a logstd, then on the forward pass, they sample from the distribution.

I considered that more as an extension to DQN (so v1.1+) but yes, why not (I added it to the roadmap). V1.0 is mainly meant to be the basic algorithms, fully tested and without extension.

In fact, from what I remember, this is very close to exploration in parameter space. For continuous actions, gSDE is already implemented (see https://arxiv.org/abs/2005.05719), however, such exploration is missing for discrete actions.

m-rph commented 4 years ago

I agree on 1.1, it will be much easier to incorporate once the code is stable. I have it in internal code for PPO and I will be happy to make a PR when the time comes.

AlessandroZavoli commented 4 years ago

Do you think that algorithms like IMPALA or APEX are worth implementing?

Miffyli commented 4 years ago

Those could be planned for v1.2+ and onwards, as discussed here. The thing is that the structure does not really fit well with the asynch nature of those algorithms.

AlessandroZavoli commented 4 years ago

multi-agent could be an very interesting step forward. I don't know if it is worth to develop a brand-new code if the structure may not fit this paradigm (the multi-agent, not one specific algorithm or another) that could be appealing for the community, given the fact that there are also few competitors.

jkterry1 commented 4 years ago

Hey, so I've been managing the development of a library called PettingZoo since January. It's basically a multi-agent version of Gym, with a very Gym like API. It's involved people from UMD, Technical University Berlin, MILA, UT Austin, Google Research, and DeepMind. We even made changes to the ALE to allow playing multi-player Atari games with RL for the first time.

https://github.com/PettingZoo-Team/PettingZoo

Right now it's believed to mostly be in fully working order and is soft released, and we're aiming for a full release in the next month or so. This is something worth consideration in the planning of multi-agent support in stable baselines.

Additionally, me and several of the people at UMD have done a bunch with multi-agent RL, and have been wanting to add support for it to SB3 once it was more production ready. If there's interest in doing official multi-agent support with SB3, we should open a separate thread for the discussion of what that should look like.

araffin commented 4 years ago

Hey, so I've been managing the development of a library called PettingZoo since January. It's basically a multi-agent version of Gym, with a very Gym like API. It's involved people from UMD, Technical University Berlin, MILA, UT Austin, Google Research, and DeepMind. We even made changes to the ALE to allow playing multi-player Atari games with RL for the first time.

https://github.com/PettingZoo-Team/PettingZoo

Right now it's believed to mostly be in fully working order and is soft released, and we're aiming for a full release in the next month or so. This is something worth consideration in the planning of multi-agent support in stable baselines.

Additionally, me and several of the people at UMD have done a bunch with multi-agent RL, and have been wanting to add support for it to SB3 once it was more production ready. If there's interest in doing official multi-agent support with SB3, we should open a separate thread for the discussion of what that should look like.

@justinkterry what you are doing is almost spam... You have been posting your project link already in 3 different issues. We appreciate the effort of making a MA version of SB3 but this will be delayed to a separate issue. In order to keep this thread clean and focus on the 1.0 roadmap, i will delete your and my comment, feel free to open a new issue for discussion about MA support.

araffin commented 4 years ago

I move the discussion of MA/Distributed agents support to https://github.com/DLR-RM/stable-baselines3/issues/69 I deleted/hide some comments to re-focus on 1.0 roadmap.

araffin commented 4 years ago

I have some good news: performances of all algorithms are matching the ones from Stable-Baselines (SB2) both in discrete and continuous environments :tada: ! See #48 and #49 (with some nice detective work from @Miffyli in #110 ) I think I will add a note in the README to show that SB3 is now trustworthy ;)

I could even reproduce results (in fact have even better ones after the bug fixes) from the gSDE paper

ManifoldFR commented 4 years ago

One interesting feature would be to supply different inputs to the policy and value networks, when these are separate. This is the principle behind Asymmetric Actor Critic, which is used at OpenAI to make the critic omniscient and get a somewhat better advantage estimate during policy optimization

araffin commented 4 years ago

One interesting feature would be to supply different inputs to the policy and value networks, when these are separate. This is the principle behind Asymmetric Actor Critic, which is used at OpenAI to make the critic omniscient and get a somewhat better advantage estimate during policy optimization

The scope of this method is unfortunately very narrow (only applies when you want to do sim2real and have access to the simulator) compared to the amount of changes needed.

cosmir17 commented 3 years ago

@PartiallyTyped I think you are already aware but I am also mentioning here. I found a source code example for the original SAC Discrete Implementation paper (the one you found).

The author also publicised his code. https://github.com/p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch/blob/master/agents/actor_critic_agents/SAC_Discrete.py

Hope these help, Sean

araffin commented 3 years ago

I found the original example for the SAC Discrete Implementation plan. Can the following paper be considered? https://arxiv.org/abs/1910.07207

We already have an issue for that #157

cosmir17 commented 3 years ago

@araffin I wasn't asking if we can implement it. Given that it has been decided as shown on the load-map and PartialTyped volunteered here, I was giving him or the team a resource. https://github.com/DLR-RM/stable-baselines3/issues/1#issuecomment-625938738 Actually, I missed the footnote need to be discussed, benefit vs DQN+extensions? I am posting my suggestion in the page now.

araffin commented 3 years ago

First release candidate is out: https://github.com/DLR-RM/stable-baselines3/releases/tag/v1.0rc0 100+ trained rl models will be published soon: https://github.com/DLR-RM/rl-baselines3-zoo/pull/69

ziegenbalg commented 3 years ago

Excellent work guys! So is there a full example of A2C (or other) games using a Dict observation space? I tried modifying the test_dict_env.py but end up with a

Traceback (most recent call last): File "/home/eziegenbalg/Documents/goldengoose_research/stable-baselines3/tests/test_dict_env.py", line 125, in <module> test_dict_spaces(A2C, True) File "/home/eziegenbalg/Documents/goldengoose_research/stable-baselines3/tests/test_dict_env.py", line 119, in test_dict_spaces model = model_class("MlpPolicy", env, gamma=0.5, seed=1, **kwargs) File "/usr/local/lib/python3.9/site-packages/stable_baselines3/a2c/a2c.py", line 80, in __init__ super(A2C, self).__init__( File "/usr/local/lib/python3.9/site-packages/stable_baselines3/common/on_policy_algorithm.py", line 76, in __init__ super(OnPolicyAlgorithm, self).__init__( File "/usr/local/lib/python3.9/site-packages/stable_baselines3/common/base_class.py", line 156, in __init__ env = self._wrap_env(env, self.verbose, monitor_wrapper) File "/usr/local/lib/python3.9/site-packages/stable_baselines3/common/base_class.py", line 209, in _wrap_env env = ObsDictWrapper(env) File "/usr/local/lib/python3.9/site-packages/stable_baselines3/common/vec_env/obs_dict_wrapper.py", line 45, in __init__ dimensions = [venv.observation_space.spaces["observation"].n, venv.observation_space.spaces["desired_goal"].n] KeyError: 'observation'

Error.

Miffyli commented 3 years ago

If you want a more practical example, see this comment that has code using ViZDoom in it. Other than that there is the documentation that has examples on using and modifying things.

ziegenbalg commented 3 years ago

@Miffyli, thank you for your quick reply. Am I understanding this correctly.. You are wrapping your Doom Dict Env in a DummyVecEnv? Why not just pass the Doom Env straight to PPO()?

Miffyli commented 3 years ago

That is used to gather rollout samples from multiple environments at the same time (leads to stabler training with PPO). SB3 also wraps all environments, even if you only have a single one, into a VecEnv under the hood.

ziegenbalg commented 3 years ago

Ok first off... are you a bot? So there was a specific reason you wrapped it in DummyVecEnv and not DummyDictEnv? I'll take a look at the difference between those DummyEnv now. I will try out the effects of wrapping with and without DummyVecEnv/DummyDictEnv with A2C and report back here for future truth seekers.

Miffyli commented 3 years ago

Ok first off... are you a bot?

Nah, just happen to be around when you are sending messages :)

So there was a specific reason you wrapped it in DummyVecEnv and not DummyDictEnv?

Yes, I see now the issue. These two things are very different: DummyDictEnv is just a testing environment used to test the support for dict-observations. DummyVecEnv is a VecEnv implementation that does not really parallelize anything. There is no connection between the two. You should not use DummyDictEnv (it is only meant for testing), you should use DummyVecEnv (or SubProcVecEnv) when you give environments to SB3 code.

DLR-RM / stable-baselines3