Stable-Baselines-Team / stable-baselines3-contrib

Contrib package for Stable-Baselines3 - Experimental reinforcement learning (RL) code
https://sb3-contrib.readthedocs.io
MIT License
443 stars 166 forks source link

a bug with MultiDiscrete action space when actions are not same size #80

Open vahidqo opened 2 years ago

vahidqo commented 2 years ago

when the environment has a action space that each action has different size, like this: self.action_space = MultiDiscrete([3,2])

and the action masker is like this for example: a = [[True, False, True],[Flase, True]]

following error happen since each "a" 's row is not same size:

`File ~\AppData\Roaming\Python\Python39\site-packages\sb3_contrib\common\maskable\distributions.py:228, in MaskableMultiCategoricalDistribution.apply_masking(self, masks)
    226 split_masks = [None] * len(self.distributions)
    227 if masks is not None:
--> 228     masks = th.as_tensor(masks)
    230     # Restructure shape to align with logits
    231     masks = masks.view(-1, sum(self.action_dims))

TypeError: can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.`
araffin commented 2 years ago

Hello, you should take a look at https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/issues/74 and https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/blob/75b2de139927da26d5871aef9fd839632f73b296/sb3_contrib/common/envs/invalid_actions_env.py#L39, I think the action mask is supposed to be flattened (@kronion if it's not in the doc, we should update it).

vahidqo commented 2 years ago

I think they are the same problem and might be a bug, however, does doing flatten make it a discrete space? so what is the point of using Multidiscrete?

vahidqo commented 2 years ago

Another problem I think is that in Multidiscrete action masking, conditional masking is impossible. For example, when the action space is like this: self.action_space = MultiDiscrete([3,2]) and masking the second action is based on the first one, for example, when action masking for the first action is like this: a = [[True, False, True],[---, ---]] the second one should be like this: a = [[---, ---, ---],[Flase, True]] so total masking is: a = [[True, False, True],[Flase, True]] but if the first action masking is like this: a = [[False, True, False],[---, ---]] the second action masking should be like this: a = [[---, ---, ---],[True, False]] so total masking is like this:a = [[Flase, True, Flase],[True, Flase]] I think this is not supported by MaskablePPO?

Hello, you should take a look at #74 and

https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/blob/75b2de139927da26d5871aef9fd839632f73b296/sb3_contrib/common/envs/invalid_actions_env.py#L39

, I think the action mask is supposed to be flattened (@kronion if it's not in the doc, we should update it).

araffin commented 2 years ago

however, does doing flatten make it a discrete space?

why? 1D array is just a way of presenting data, as we will reshape it afterward.

The main difference between discrete and multi discrete is that you take multiple discrete actions at every steps.

think is that in Multidiscrete action masking, conditional masking is impossible. and masking the second action is based on the first one,

the env defines and computes the masking, so I'm not sure to get why it is not possible to give the agent the correct masking at a given step.

vahidqo commented 2 years ago

the env defines and computes the masking, so I'm not sure to get why it is not possible to give the agent the correct masking at a given step.

because the masking is defined to just depend on the current state, however, the masking could be dependent on both the current state and masking of other actions. for example, in a Multi discrete action space with two actions, the masking of the second action depends on the current state and the masking of the first action.

araffin commented 2 years ago

the masking could be dependent on both the current state and masking of other actions.

I'm still not sure to get it. The masking of other actions would depend on the state, right?

So even if the masking of the second action depend on the current state and masking of the first action, because the masking of the first action depends on the state, then the masking of the second action can be computed only from the state.

Or are you talking about taking two actions but sequentially and not two at the same time? (in that case, this is not a multi discrete space)

And 1D vs 2D array mask does not seem to make a difference.

vahidqo commented 2 years ago

Or are you talking about taking two actions but sequentially and not two at the same time? (in that case, this is not a multi discrete space)

maybe this is more clear, however, it's not sequentially exactly. it's better to say that the masking of the second action depends on which first action has been chosen.

araffin commented 2 years ago

then your problem can be define with a Discrete space, not a multi discrete one.

Yangxiaojun1230 commented 2 years ago

@araffin Hi araffin, I installed sb3-contrib by "pip install sb3-contrib" , why still encounter problem" ImportError: cannot import name 'register_policy' from 'stable_baselines3.common.policies'". I think you already use policy aliases instead of 'register_policy' .

araffin commented 2 years ago

if you install sb3 master version, you must do the same for sb3 contrib. PS: please don't post off topic comments.

Yangxiaojun1230 commented 1 year ago

Or are you talking about taking two actions but sequentially and not two at the same time? (in that case, this is not a multi discrete space)

maybe this is more clear, however, it's not sequentially exactly. it's better to say that the masking of the second action depends on which first action has been chosen.

Hi author, do you find any solution to handle your scenario? I have a same problem like yours.

Yangxiaojun1230 commented 1 year ago

the masking could be dependent on both the current state and masking of other actions.

I'm still not sure to get it. The masking of other actions would depend on the state, right?

So even if the masking of the second action depend on the current state and masking of the first action, because the masking of the first action depends on the state, then the masking of the second action can be computed only from the state.

Or are you talking about taking two actions but sequentially and not two at the same time? (in that case, this is not a multi discrete space)

And 1D vs 2D array mask does not seem to make a difference.

Hi araffin, for the scenario you mentioned "two actions but take not at a time" , do you have any advice how to handle such case? I don't think this is a multi-agent case,since the second action mask depands on the first action result.

vahidqo commented 1 year ago

Or are you talking about taking two actions but sequentially and not two at the same time? (in that case, this is not a multi discrete space)

maybe this is more clear, however, it's not sequentially exactly. it's better to say that the masking of the second action depends on which first action has been chosen.

Hi author, do you find any solution to handle your scenario? I have a same problem like yours.

Hi, not yet..

Yangxiaojun1230 commented 1 year ago

Or are you talking about taking two actions but sequentially and not two at the same time? (in that case, this is not a multi discrete space)

maybe this is more clear, however, it's not sequentially exactly. it's better to say that the masking of the second action depends on which first action has been chosen.

Hi author, do you find any solution to handle your scenario? I have a same problem like yours.

Hi, not yet..

Maybe you can try VecEnv, this class could execute environments in sequence. I found it could set the value of variables in other environment. Got the first action result and then modify the corresponding varibales then call second env action , thus the second env action generated based on results of first env.

vahidqo commented 1 year ago

Or are you talking about taking two actions but sequentially and not two at the same time? (in that case, this is not a multi discrete space)

maybe this is more clear, however, it's not sequentially exactly. it's better to say that the masking of the second action depends on which first action has been chosen.

Hi author, do you find any solution to handle your scenario? I have a same problem like yours.

Hi, not yet..

Maybe you can try VecEnv, this class could execute environments in sequence. I found it could set the value of variables in other environment. Got the first action result and then modify the corresponding varibales then call second env action , thus the second env action generated based on results of first env.

could you provide a simple code example?

Yangxiaojun1230 commented 1 year ago

You could refer to https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/common/vec_env/dummy_vec_env.py In DummyVecEnv() , modify step_wait() function according to your own needs.

CppMaster commented 1 year ago

maybe this is more clear, however, it's not sequentially exactly. it's better to say that the masking of the second action depends on which first action has been chosen.

Conditional action masking depending on a chosen action is not developed in the repo. It could be done, however.

Let's say you have multi discrete space of [2, 3], where the masks for the 2nd part depends on which action do you take in the 1st part. When you take 0 in the 1st part, then masking for the 2nd would be: [True, False, True] and if you take 1, it would be: [True, True, False]. Current action mask would be a list of 5 elements (2+3), so it's not sufficient. Instead you could pass action mask in form of: ([True, True], [[True, False, True], [True, True, False]]), where the first list has masking for the 1st part and the 2nd list has masking when you take 0 and when you take 1, stacked. Then you would also need to override functions in MaskableCategorical so it does apply this conditional masking correctly.

darkopetrovic commented 5 months ago

I have the following custom environment with MultiDiscrete actions of the same size:

import gymnasium as gym
import numpy as np
from sb3_contrib.ppo_mask import MaskablePPO
from sb3_contrib.common.maskable.utils import get_action_masks

class MyCustomEnv(gym.Env):
  def __init__(self):
    self.observation_space = gym.spaces.Box(low=0, high=1, shape=(8 * 8,), dtype=np.uint8)
    self.action_space = gym.spaces.MultiDiscrete([8, 8], dtype=np.uint8)

  def reset(self, seed=None):
    super().reset(seed=seed)
    return self._obs(), {}

  def _obs(self):
    return np.zeros(shape=(8 * 8, ), dtype=np.uint8)

  def step(self, action):
    return self._obs(), 1.0, False, False, {}

  def action_masks(self):
    return np.zeros(shape=(8,), dtype=np.int8), np.ones(shape=(8,), dtype=np.int8)

env = MyCustomEnv()

When I sample actions from this environment and run MaskablePPO everything works:

env.action_space.sample(get_action_masks(env))
> array([0, 7], dtype=uint8)

model = MaskablePPO("MlpPolicy", env)
model.learn(total_timesteps=10)

However, when the MultiDiscrete action is not of the same size, like this:

class MyCustomEnv(gym.Env):
  def __init__(self):
    self.observation_space = gym.spaces.Box(low=0, high=1, shape=(8 * 8,), dtype=np.uint8)
    self.action_space = gym.spaces.MultiDiscrete([8, 4], dtype=np.uint8)

  def reset(self, seed=None):
    super().reset(seed=seed)
    return self._obs(), {}

  def _obs(self):
    return np.zeros(shape=(8 * 8, ), dtype=np.uint8)

  def step(self, action):
    return self._obs(), 1.0, False, False, {}

  def action_masks(self):
    return np.zeros(shape=(8,), dtype=np.int8), np.ones(shape=(4,), dtype=np.int8)

env = MyCustomEnv()

sampling actions still works:

env.action_space.sample(get_action_masks(env))
> array([0, 2], dtype=uint8)

but running MaskablePPO produces the following error:

model = MaskablePPO("MlpPolicy", env)
model.learn(total_timesteps=10)
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.

It is still not clear for me if this is a bug or we should handle differently MultiDiscrete action of different size.

How should I rewrite the above code to make it work?

araffin commented 5 months ago

How should I rewrite the above code to make it work?

Have you looked at https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/issues/80#issuecomment-1159013599 ? You should flatten the mask. It is tested here in SB3-contrib: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/blob/75b2de139927da26d5871aef9fd839632f73b296/tests/test_invalid_actions.py#L95-L103

Your first example works by luck.

darkopetrovic commented 5 months ago

Ok thanks.

I'm flattening as below and running MaskablePPO works now.

def action_masks(self):
    return np.concatenate([np.zeros(shape=(8,), dtype=np.int8), np.ones(shape=(4,), dtype=np.int8)])

I was induce in error when I sample the actions:

env.action_space.sample(get_action_masks(env))
-> AssertionError: Expects the mask to be a tuple for sub_nvec ([8 4]), actual type: <class 'numpy.ndarray'>

But I got it, this is because of the unflatten MultiDiscrete action space of the environment.

I guess we should unflatten the mask manually before sampling:

env.action_space.sample(tuple(np.array_split(get_action_masks(env), [8, ])))
araffin commented 5 months ago

env.action_space.sample is from gym, not sb3, requirements are different (i think we had sb3 api before gym supported mask).

mwalidcharrwi commented 1 month ago

Hello, you should take a look at #74 and

https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/blob/75b2de139927da26d5871aef9fd839632f73b296/sb3_contrib/common/envs/invalid_actions_env.py#L39

, I think the action mask is supposed to be flattened (@kronion if it's not in the doc, we should update it).

I checked the code segment shared, will this be applied on the ActionMasker(env, mask_fn)? I checked the documentation and there is nothing on InvalidActionEnvMultiDiscrete.

araffin commented 1 week ago

I checked the code segment shared, will this be applied on the ActionMasker(env, mask_fn)? I checked the documentation and there is nothing on InvalidActionEnvMultiDiscrete.

https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/blob/75b2de139927da26d5871aef9fd839632f73b296/sb3_contrib/common/envs/invalid_actions_env.py#L75-L76

https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/blob/75b2de139927da26d5871aef9fd839632f73b296/sb3_contrib/common/envs/invalid_actions_env.py#L65-L72