Closed araffin closed 3 years ago
Maybe N-step returns for TD3 (and DDPG) and DQN (and friends) ? If it's implemented in the experience replay, then it is likely plug and play for TD3 and DQN, an implementation for SAC probably requires extra effort.
Perhaps at a later time, e.g. V1.1+ retrace, tree backup, Q(lambda), Importance Sampling for n-step returns? If retrace and friends are planned for later, then it should be taken into consideration when implementing n-steps.
@PartiallyTyped
Yup, that would be v1.1 thing but indeed planned. Should probably go over original SB issues to gather all these suggestions at some point.
Perhaps a discrete version of SAC for v1.1+? https://arxiv.org/abs/1910.07207
Edit: I can implement this, and add types to the remaining methods after my finals (early June).
I will start working on the additional observation/action spaces this weekend 👍
Will the stable baselines 3 repo/package replace the existing stable baselines one, or will all this eventually be merged into the normal stable baselines repo?
@justinkterry
There are no plans to merge/combine the two repositories. Stable-baselines will continue to exist, and continue to receive bug-fixes and the like for some time before it is archived.
@Miffyli Thank you. Will it remain as "pip3 install stable-baselines," or become something like "pip3 install stable-baselines3"?
@justinkterry
You can already install sb3 with pip3 install stable-baselines3
. The original repo will stay as pip3 install stable-baselines
.
Minor point but I wonder if we should rename BaseRLModel
to BaseRLAlgorithm
and BasePolicy
to BaseModel
, given that BasePolicy
is more than just a policy?
Minor point but I wonder if we should rename BaseRLModel to BaseRLAlgorithm and BasePolicy to BaseModel, given that BasePolicy is more than just a policy?
Good point, BaseModel
and BaseRLAlgorithm
are definitely better names ;)
for visualization, probably using something like weights & biases (https://www.wandb.com/) is an option? So that no need for tensorflow dependency. I can help to add functions to do that.
for visualization, probably using something like weights & biases (https://www.wandb.com/) is an option?
correct me if I'm wrong but W&B does not work offline, no? This is really important as you don't want your results to be published when you do private work.
This could be also implemented either as a callback (cf doc) or a new output for the logger. But sounds more like a "contrib" module to me.
Perhaps an official shorthand for stable-baselines
and stable-baselines3
e.g. sb
and sb3
?
import stable_baselines3 as sb3
Is it necessary to continue to provide the interface for vectorized environments inside of this codebase?
They were contributed upstream back to gym in this PR. After that PR was merged, packages such as PyBullet (pybullet_envs
) started providing vectorized variants of their own environments using the interface from gym which should be the same as the one here (for now)
@ManifoldFR Somehow that has eluded my attention. Looks like a good suggestion! Less repeated code is better, as long it fits in stable-baselines functions too.
@araffin thoughts (you have most experience doing the eval/wrap functions)? I imagine the hardest part is to update all wrappers that work on vectorized environments.
I happened onto it by chance because it's not documented anywhere inside of gym's docs, the openai people ported it from their own baselines repo with barely any notification of the change to end users.
I was aware of this (wrote some comments at that time https://github.com/openai/gym/pull/1513/files#r293899941) but I would argue against for different reasons:
So, in short, I would be in favor only if OpenAI way of maintaining Gym was more reliable.
PS: thanks @ManifoldFR for bringing that up ;)
I see, it figures that OpenAI changing things unilaterally without documentation would be a problem.
I guess ensuring stable-baselines3's code doesn't break when running vectorized envs derived from Gym's implementation would be safer and easier instead...
How about vectorized/stacked action noise for TD3/SAC? I am referring to https://github.com/hill-a/stable-baselines/issues/805 . This will be a useful stepping stone for multiprocessing in TD3/SAC.
The motivation behind this addition is that for OU or other noise processes we need the state to exist until the end of the trajectory, thus we can't use a single process for multiple parallel episodes as they have different lengths. Thus, stacked/vectorized processes allow us to keep the processes for as long as the particular episode goes.
Also the code is done and has types, I'd just change the *lst
in the reset function to a single argument over a variadic argument.
How about vectorized/stacked action noise for TD3/SAC?
sounds good. If you have test included, you can do the PR.
It won't be used for a while though... a bit like what we did for dict support inside VecEnv
(was not used until HER was re-implemented).
For v1.1+ Network Randomization appears to be useful and simple enough to implement. I will implement it for a project I am working on (Obstacle Tower) so if you believe it will be useful to have, I can open an issue specifically for it and discuss it there.
Network Randomization appears to be useful and simple enough to implement.
I would rather favor what we did for the pre-training (issue #27 ), create an example code/notebook and link to it in the documentation.
What do you think about Noisy Linear layers for v1.0?
Noisy Linear layers
What is that?
It's from this paper. The tl:dr is that they are linear layers that have parameter noise. This results in better exploration over e-greedy/ entropy bonus. The layers learn a mu and a logstd, then on the forward pass, they sample from the distribution.
The tl:dr is that they are linear layers that have parameter noise. This results in better exploration over e-greedy/ entropy bonus. The layers learn a mu and a logstd, then on the forward pass, they sample from the distribution.
I considered that more as an extension to DQN (so v1.1+) but yes, why not (I added it to the roadmap). V1.0 is mainly meant to be the basic algorithms, fully tested and without extension.
In fact, from what I remember, this is very close to exploration in parameter space. For continuous actions, gSDE is already implemented (see https://arxiv.org/abs/2005.05719), however, such exploration is missing for discrete actions.
I agree on 1.1, it will be much easier to incorporate once the code is stable. I have it in internal code for PPO and I will be happy to make a PR when the time comes.
Do you think that algorithms like IMPALA or APEX are worth implementing?
Those could be planned for v1.2+ and onwards, as discussed here. The thing is that the structure does not really fit well with the asynch nature of those algorithms.
multi-agent could be an very interesting step forward. I don't know if it is worth to develop a brand-new code if the structure may not fit this paradigm (the multi-agent, not one specific algorithm or another) that could be appealing for the community, given the fact that there are also few competitors.
Hey, so I've been managing the development of a library called PettingZoo since January. It's basically a multi-agent version of Gym, with a very Gym like API. It's involved people from UMD, Technical University Berlin, MILA, UT Austin, Google Research, and DeepMind. We even made changes to the ALE to allow playing multi-player Atari games with RL for the first time.
https://github.com/PettingZoo-Team/PettingZoo
Right now it's believed to mostly be in fully working order and is soft released, and we're aiming for a full release in the next month or so. This is something worth consideration in the planning of multi-agent support in stable baselines.
Additionally, me and several of the people at UMD have done a bunch with multi-agent RL, and have been wanting to add support for it to SB3 once it was more production ready. If there's interest in doing official multi-agent support with SB3, we should open a separate thread for the discussion of what that should look like.
Hey, so I've been managing the development of a library called PettingZoo since January. It's basically a multi-agent version of Gym, with a very Gym like API. It's involved people from UMD, Technical University Berlin, MILA, UT Austin, Google Research, and DeepMind. We even made changes to the ALE to allow playing multi-player Atari games with RL for the first time.
https://github.com/PettingZoo-Team/PettingZoo
Right now it's believed to mostly be in fully working order and is soft released, and we're aiming for a full release in the next month or so. This is something worth consideration in the planning of multi-agent support in stable baselines.
Additionally, me and several of the people at UMD have done a bunch with multi-agent RL, and have been wanting to add support for it to SB3 once it was more production ready. If there's interest in doing official multi-agent support with SB3, we should open a separate thread for the discussion of what that should look like.
@justinkterry what you are doing is almost spam... You have been posting your project link already in 3 different issues. We appreciate the effort of making a MA version of SB3 but this will be delayed to a separate issue. In order to keep this thread clean and focus on the 1.0 roadmap, i will delete your and my comment, feel free to open a new issue for discussion about MA support.
I move the discussion of MA/Distributed agents support to https://github.com/DLR-RM/stable-baselines3/issues/69 I deleted/hide some comments to re-focus on 1.0 roadmap.
I have some good news: performances of all algorithms are matching the ones from Stable-Baselines (SB2) both in discrete and continuous environments :tada: ! See #48 and #49 (with some nice detective work from @Miffyli in #110 ) I think I will add a note in the README to show that SB3 is now trustworthy ;)
I could even reproduce results (in fact have even better ones after the bug fixes) from the gSDE paper
One interesting feature would be to supply different inputs to the policy and value networks, when these are separate. This is the principle behind Asymmetric Actor Critic, which is used at OpenAI to make the critic omniscient and get a somewhat better advantage estimate during policy optimization
One interesting feature would be to supply different inputs to the policy and value networks, when these are separate. This is the principle behind Asymmetric Actor Critic, which is used at OpenAI to make the critic omniscient and get a somewhat better advantage estimate during policy optimization
The scope of this method is unfortunately very narrow (only applies when you want to do sim2real and have access to the simulator) compared to the amount of changes needed.
@PartiallyTyped I think you are already aware but I am also mentioning here. I found a source code example for the original SAC Discrete Implementation paper (the one you found).
The author also publicised his code. https://github.com/p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch/blob/master/agents/actor_critic_agents/SAC_Discrete.py
Hope these help, Sean
I found the original example for the SAC Discrete Implementation plan. Can the following paper be considered? https://arxiv.org/abs/1910.07207
We already have an issue for that #157
@araffin I wasn't asking if we can implement it. Given that it has been decided as shown on the load-map and PartialTyped volunteered here, I was giving him or the team a resource. https://github.com/DLR-RM/stable-baselines3/issues/1#issuecomment-625938738
Actually, I missed the footnote
need to be discussed, benefit vs DQN+extensions?
I am posting my suggestion in the page now.
First release candidate is out: https://github.com/DLR-RM/stable-baselines3/releases/tag/v1.0rc0 100+ trained rl models will be published soon: https://github.com/DLR-RM/rl-baselines3-zoo/pull/69
Excellent work guys! So is there a full example of A2C (or other) games using a Dict observation space? I tried modifying the test_dict_env.py but end up with a
Traceback (most recent call last): File "/home/eziegenbalg/Documents/goldengoose_research/stable-baselines3/tests/test_dict_env.py", line 125, in <module> test_dict_spaces(A2C, True) File "/home/eziegenbalg/Documents/goldengoose_research/stable-baselines3/tests/test_dict_env.py", line 119, in test_dict_spaces model = model_class("MlpPolicy", env, gamma=0.5, seed=1, **kwargs) File "/usr/local/lib/python3.9/site-packages/stable_baselines3/a2c/a2c.py", line 80, in __init__ super(A2C, self).__init__( File "/usr/local/lib/python3.9/site-packages/stable_baselines3/common/on_policy_algorithm.py", line 76, in __init__ super(OnPolicyAlgorithm, self).__init__( File "/usr/local/lib/python3.9/site-packages/stable_baselines3/common/base_class.py", line 156, in __init__ env = self._wrap_env(env, self.verbose, monitor_wrapper) File "/usr/local/lib/python3.9/site-packages/stable_baselines3/common/base_class.py", line 209, in _wrap_env env = ObsDictWrapper(env) File "/usr/local/lib/python3.9/site-packages/stable_baselines3/common/vec_env/obs_dict_wrapper.py", line 45, in __init__ dimensions = [venv.observation_space.spaces["observation"].n, venv.observation_space.spaces["desired_goal"].n] KeyError: 'observation'
Error.
If you want a more practical example, see this comment that has code using ViZDoom in it. Other than that there is the documentation that has examples on using and modifying things.
@Miffyli, thank you for your quick reply. Am I understanding this correctly.. You are wrapping your Doom Dict Env in a DummyVecEnv? Why not just pass the Doom Env straight to PPO()?
That is used to gather rollout samples from multiple environments at the same time (leads to stabler training with PPO). SB3 also wraps all environments, even if you only have a single one, into a VecEnv under the hood.
Ok first off... are you a bot? So there was a specific reason you wrapped it in DummyVecEnv and not DummyDictEnv? I'll take a look at the difference between those DummyEnv now. I will try out the effects of wrapping with and without DummyVecEnv/DummyDictEnv with A2C and report back here for future truth seekers.
Ok first off... are you a bot?
Nah, just happen to be around when you are sending messages :)
So there was a specific reason you wrapped it in DummyVecEnv and not DummyDictEnv?
Yes, I see now the issue. These two things are very different: DummyDictEnv is just a testing environment used to test the support for dict-observations. DummyVecEnv is a VecEnv implementation that does not really parallelize anything. There is no connection between the two. You should not use DummyDictEnv (it is only meant for testing), you should use DummyVecEnv (or SubProcVecEnv) when you give environments to SB3 code.
This issue is meant to be updated as the list of changes is not exhaustive
Dear all,
Stable-Baselines3 beta is now out :tada: ! This issue is meant to reference what is implemented and what is missing before a first major version.
As mentioned in the README, before v1.0, breaking changes may occur. I would like to encourage contributors (especially the maintainers) to make comments on how to improve the library before v1.0 (and maybe make some internal changes).
I will try to review the features mentioned in https://github.com/hill-a/stable-baselines/issues/576 (and https://github.com/hill-a/stable-baselines/issues/733) and I will create issues soon to reference what is missing.
What is implemented?
What are the new features?
EvalCallback
)VecEnv
What is missing?
Checklist for v1.0 release
What is next? (for V1.1+)
action_proba
in the base class?side note: should we change the default
start_method
tofork
? (now that we don't have tf anymore)