[Feature Request] Multi-Agent (MA) Support / Distributed algorithms (IMPALA/APEX)

araffin commented 4 years ago

Here is an issue to discuss about multi-agent and distributed agent support.

My personal view on that is this should be done outside SB3 (even though it could use SB3 as a base) and anyway not before 1.2+.

Related issues:

Related projects: "Slime Volley Ball" (self-play) and "Adversarial Policies" in https://stable-baselines.readthedocs.io/en/master/misc/projects.html

This may interest: @eugenevinitsky @justinkterry @Ujwal2910 @AlessandroZavoli anyone else?

Ujwal2910 commented 4 years ago

Would be happy to contribute

Miffyli commented 4 years ago

My personal view on that is this should be done outside SB3 (even though it could use SB3 as a base) and anyway not before 1.2+.

A thought: Maybe it could be part of the planned "contrib" repo, if possible? Albeit depending on the level of "multi-agentism" we want to have (how much agents communicate etc), this may require a big rework of the code.

AlessandroZavoli commented 4 years ago

One possibility (i hope it simplifies the problem statement) is that M agents interact with one environment sending a [M x Naction] tensor(?) of actions and they receive back a [M x Nobs, M x 1, Mx1] observation, reward, done signals

jkterry1 commented 4 years ago

@Miffyli If you take a look at the multi-agent DRL methods that are currently widely used in the literature for cooperative scenarios, virtually everything you see is just single agent learning happening in parallel, or more commonly that with some or all network parameters shared. The most commonly used MARL methods for cooperative games are centralized critic methods, like MADDPG or COMA. This can can be done with SB3 with very minor modifications.

Learning in competitive scenarios is all based around things like self play, which can also be done with minor modification or wrapping. The problem is that to get good performance you typically have to employ things like leagues, like what AlphaStar did, which to me seems beyond the scope of even a multi-agent version of stable baselines.

@AlessandroZavoli So many other people have tried to create clean multi-agent RL APIs before. Notable examples include RLlib and Open Spiel. If you take a look at RLlib's it handles it in a way that's a bit similar in philosophy to yours, but with dictionaries instead of packing things into a tensor. The problem with that approach, and what you described, is that they assume all agents act and observe simultaneously. It turns out that that's a really problematic assumption, because APIs based around this assumption can't cleanly handle strictly turn based games (like chess or go). It also turns out that practically, there's a ton of games that seem like they're fully parallel, but actually aren't, which has caused a bunch of almost impossible to track down bugs in various major MARL environments. I actually have a paper under review at NeurIPS that's in part about a bug in an environment from Mykel Kochenderfer's group caused by this exact problem.

Other problems specific to what you propose arise from the fact that in many multi-agent environments you have to support agents dieing, or the number of them changing in general. Also, many times agents need to have different sized action and observation spaces for different types of agents, which turning everything into a tensor like you proposed doesn't allow for. It also turns out that you can't quite due the naive thing of just iterating through all agents, because how you have to handle reward in those scenarios often gets incredibly weird.

It turns out that, having tried to do it myself many times, creating a unified API that makes sense in all typical cases and isn't really ugly and difficult to work with is really hard. Open Spiel was the first major library to achieve reasonable support for every type of environment you might want to, but their API is has a lot of undesirable properties. Accuse me of spamming if you must, but I'd really encourage you to at least take a look at the PettingZoo API. It's the result of a lot of people people who primarily study multi-agent RL spending a lot of time thinking about this problem. It also includes every popular MARL environment from the literature under one API and has a API which is very similar to Gym, both of which various other practical benefits.

AlessandroZavoli commented 4 years ago

@justinkterry i don't have your experience, so takes this as an ingenuous point of view.

We may start from a simple problem, and then move to more complex ones, without attempting at solving all existing MARL types with one API. If you think it is worth, we may start with MADDPG and fixed number of agents in parallel if that simplifies the problem statement.

With respect to using an existent API for such a core block, I think the decision should be on the maintainers, pointing out pros and cons.

jkterry1 commented 4 years ago

Different types of MARL methods bootstrap different single agent methods, so adding support for a class of MARL methods (i.e. centralized critic) for multiple different single-agent RL methods is almost trivial. MADDPG, for instance, is centralized critic DDPG. I've been involved with a handful of MADDPG implementations and for a handful of reasons they always lend themselves to severe bugs that are almost impossible to sort out. The best place to start would be fully independent learning and/or full parameter sharing for cooperative environments. That's what I've been planning to do when SB3 gets to an adequate state.

I'd also argue that, independent of PettingZoo, it's a dramatically better choice to pick an API that exists than to make your own.

benblack769 commented 4 years ago

I aggree that keeping the algorithm (AC2, TD3, etc) separate from the framework (apex, parameter sharing, etc) is a powerful way of supporting a wide variety of use cases easily.

However, this requires a stable API that supports all the features the frameworks requires. So I've tried to see what needs to be done to implement these and work backwards:

Distributed replay framework (offline policies only)

Seems to only require:

An efficient way of getting and setting policy parameters so that the distributed actors can be efficiently updated.
A way to customize the replay buffer so it can be replaced with a client for a distributed replay buffer.
A way to customize the logger object to replace it with a distributed logger.

Full parameter sharing

For the very simplest cases which closely follow the partially observable markov game model, these games can actually be modeled as vector environments. So supporting these seems to only require:

Supporting custom vector environments for all algorithms (seems that some environments already support this).

For traditional games like chess or even slightly complicated cases (for example certain atari games in https://github.com/PettingZoo-Team/Multi-Agent-ALE), this doesn't work, as agents may die before the environment is done or not take turns every step. So instead, more customization is needed. Some options are:

Worst case, the collect_rollouts method needs to be rewritten to handle the multi agent environment API. Unfortunately, this method is a little bit intimidating, and is not consistent across algorithms, making it horrible to customize.
So instead the collect_rollouts method is refactored to make it easier to override. Perhaps seperating out the part where the actions are calculated from the part which interacts with the environment. Also the notion of an adder may be a powerful refactor.

Summary

For APEX:

Adding getters and setters to the policy API
A way to customize the replay/rollout buffer for a given algorithm.
A way to customize the logger for a given algorithm.

For parameter sharing:

Supporting vector environments for all algorithms
refactoring the collect_rollouts method to make it reasonable to override.

Of course, I may be missing some subtle detail and there may be more that needs to be done, but I think this is a good start.

jkterry1 commented 3 years ago

So we added basic support for multiagent environments to stable baselines with a third party wrapper. I wrote a small tutorial on how to use it targeted at beginners in RL:

https://towardsdatascience.com/multi-agent-deep-reinforcement-learning-in-15-lines-of-code-using-pettingzoo-e0b963c0820b

araffin commented 3 years ago

So we added basic support for multiagent environments to stable baselines with a third party wrapper. I wrote a small tutorial on how to use it targeted at beginners in RL:

thanks for sharing. Two quick questions:

why did you use SB2 and not SB3?
why did you use the stochastic policy at test time?

jkterry1 commented 3 years ago

Sorry for the delayed reply. SB2 was because a recent change you made in the 1.0 release of SB3 and broke supersuit, it's been fixed and the tutorial has been updated to use SB3. Using a stochastic policy for testing was an accident, thank you for pointing that out.

Also out of curiosity, is supersuit of any interest to the stable baselines sphere? Simple single line preprocessing wrappers for Gym environments seems to fit fairly well with your ethos.

araffin commented 3 years ago

is supersuit of any interest to the stable baselines sphere? Simple single line preprocessing wrappers for Gym environments seems to fit fairly well with your ethos.

this could be probably including in our doc, under the "projects" section. What do you think @Miffyli ? (and as it is a separate package, I don't think it really fit SB3 contrib in that case)

Miffyli commented 3 years ago

I agree a link under "projects" section is more suitable as it is a separate package, unless the idea is to bring all the wrappers etc into contrib (i.e. to merge the two repos).

EloyAnguiano commented 3 years ago

Do you know if there are any plans to include model-based algorithms such as PlaNet or Dreamer? Sorry if this issue is not the place to post it, but I don't see any related to model-based algorithms and I see that new inclusions are being discussed here.

Miffyli commented 3 years ago

Do you know if there are any plans to include model-based algorithms such as PlaNet or Dreamer? Sorry if this issue is not the place to post it, but I don't see any related to model-based algorithms and I see that new inclusions are being discussed here.

There are no specific plans to add these algorithms, however implementations of anything that fits into SB3 format are welcome to the contrib repository! We as maintainers do not have time to implement all possible algos out there ^^

araffin commented 3 years ago

Do you know if there are any plans to include model-based algorithms such as PlaNet or Dreamer? Sorry if this issue is not the place to post it, but I don't see any related to model-based algorithms and I see that new inclusions are being discussed here.

There are no specific plans to add these algorithms, however implementations of anything that fits into SB3 format are welcome to the contrib repository! We as maintainers do not have time to implement all possible algos out there ^^

As mentioned in our blog post, "to keep SB3 simple to use and maintain, we focus on model-free, single-agent RL algorithms", so we won't implement model based algorithms in SB3. As @Miffyli wrote, it may be the place for the contrib repo but in fact, I would prefer that to be an external project as we do for imitation or offline RL.

PS: the correct place to ask would have been a separate "question" issue but I think here is fine as long as we keep the discussion short.

araffin commented 1 year ago

closing as outside the scope of SB3

DLR-RM / stable-baselines3