DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
8.99k stars 1.69k forks source link

[Question] Multi Output Policy Support? #527

Open H-Park opened 3 years ago

H-Park commented 3 years ago

Question

Are multi output policies supported yet? I see that dictionary observations are supported per the docs, however I do not see anything out multi output policies...

Additional context

I am wanting to make a wrapper around PySC2 now that dictionary observations are supported, however multiple output policy support is still required.

Checklist

Miffyli commented 3 years ago

This is a feature I think would nicely complement dictionary observations nicely. In the past we talked with @araffin about this, and the biggest issues are 1) what is the correct implementation of it and 2) what to do about support for off-policy algorithms (very different implementation. I think A2C and PPO could support multiple, independent action spaces, and this should work well.

@araffin Comments? Should this be a contrib thing if DQN/SAC/TD3 implementation is not trivial or doable? At least on A2C/PPO side, independent action spaces is a common way to approach this.

araffin commented 3 years ago

I am wanting to make a wrapper around PySC2 now that dictionary observations are supported, however multiple output policy support is still required.

what type of multi output policy is required? (discrete/continuous or other?)

@araffin Comments? Should this be a contrib thing if DQN/SAC/TD3 implementation is not trivial or doable? At least on A2C/PPO side, independent action spaces is a common way to approach this.

I haven't much more comments than in https://github.com/DLR-RM/stable-baselines3/issues/349#issuecomment-800198204

At least on A2C/PPO side, independent action spaces is a common way to approach this.

ah, do you have some reference for that?

Miffyli commented 3 years ago

ah, do you have some reference for that?

Not a solid one right now, but at least this paper suggests to start with independent spaces before trying to investigate if adding dependencies would help. The latter would be very task specific and hardly support-able in SB3, while independent spaces would be a very easy feat, comparably.

H-Park commented 3 years ago

what type of multi output policy is required? (discrete/continuous or other?)

PySC2 docs say it's a discrete, and a box (for x, y of move).

Now that I think about this, this can be done with a multidiscrete output space with PPO.

But this feature would be really awesome!

araffin commented 1 year ago

It seems that @adysonmaia implemented PPO with dict action space support here: https://github.com/adysonmaia/sb3-plus/blob/main/sb3_plus/mimo_ppo/ppo.py#L24

adysonmaia commented 1 year ago

It seems that @adysonmaia implemented PPO with dict action space support here: https://github.com/adysonmaia/sb3-plus/blob/main/sb3_plus/mimo_ppo/ppo.py#L24

Hi, I just started an implementation of PPO supporting dict action space for independent actions. At the moment, there isn't any documentation or validation tests yet. However, an "official" support of this feature in either SB3 or SB3-Contrib projects would be really interesting.

EloyAnguiano commented 1 year ago

@adysonmaia are you planning on adding this feature to sb3-contrib or publishing sb3-plus to install with pip? I am very insterested on this, so please tell me if it cold be soon or not. Thanks in advance

adysonmaia commented 1 year ago

Hi @EloyAnguiano, I intend to push the sb3-plus project as a pip repository when its code is more stable and tested. For now, it's possible to install it via pip using the GitHub url. For example: pip install git+https://github.com/adysonmaia/sb3-plus#egg=sb3-plus