Farama-Foundation / Gymnasium

An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym)
https://gymnasium.farama.org
MIT License
7.3k stars 812 forks source link

[Question] In Gym + TF-Agents context, how to make actions space as a function of the current state masking impossible actions #434

Closed fede72bari closed 1 year ago

fede72bari commented 1 year ago

Question

I saw other old posts spinning around this topic, but the general answers given were like "don't care, let the NN learn that, in that specific state, it cannot take some actions punishing it!". Well, I don't like it! For several reasons:

many publications speak about the action space not as A, but as A(s). So it is normal to consider the action space as a function of the current state s.

in reality if you have a wall on your left it is not just a matter of hurting yourself trying to pass it, you simply don't have this option. I cannot understand why my RL should still have a chance, even if not probable, to go left after the training

why should we accept extra learning effort to let the RL agent learn something that is already known?

just to mention some of the reasons. I saw that the Discrete object defined in the Gymnasium library has a masking array in order to declare which actions are available, but I can see it just in the random sampling function

def sample(self, mask: Optional[np.ndarray] = None) -> int:

"""Generates a single random sample from this space.

A sample will be chosen uniformly at random with the mask if provided Rather than this, I think that implementing a "dynamic" actions space as a function of the current state should impact somehow during the training process on the agent.collect_policy. I am struggling on finding complete and working examples of how to implement such a simple capability. That is not so simple to me in the end and I would like to understand if there are not well-documented (as many other things regretfully) already developed and elegant solutions in the TF-Agents / TensorFlow context.

pseudo-rnd-thoughts commented 1 year ago

Gym v0.25+ and Gymnasium support masking of the action space to disable certain actions which does the thing that you wish. We recommend returning the action mask for each observation in the info of env.step.

Then you will need to update your policy, to sample using the action mask

fede72bari commented 1 year ago

Since I am learning, could you provide a reference to documentation with a full example? The critical parts for me are the masking creation and the usage during the training process when it is called the policy (even more integration with TF-Agents). Thank you very much for supporting.

pseudo-rnd-thoughts commented 1 year ago

I believe that cleanrl have an implementation that uses action masking otherwise, it is not that common to my knowledge

rademacher-p commented 1 year ago

@pseudo-rnd-thoughts: My feeling is that a proper implementation belongs in spaces. As the OP suggests, MDP formalisms often treat the state-dependency explicitly as $\mathcal{A}(s)$. For agents taking exploratory actions, it is desirable to have action_space.sample() work without modification, IMO.

A while back I implemented a subclass of Discrete that I called DiscreteMasked, which used numpy.ma and a mask attribute that the Env is responsible for changing throughout an episode.

pseudo-rnd-thoughts commented 1 year ago

@rademacher-p That sounds interesting, we would be interested in your implementation if it adds something new. This makes me think it would be nice to support a probability distribution over the actions rather than a purely binary option

fede72bari commented 1 year ago

I am a beginner in ML and RL; so, for sure, I cannot give a meaningful contribution to developing a solution for illegal actions masking. But I gave a glance around and found interesting things, provided that the masking shall act somehow on the NN architecture during the training process and consequently that the problem is not the mask definition in the Gym environment (it can be passed through the "info" structure), the issue is its utilization for training the agent.

  1. this paper formalizes what I got just by intuition showing the benefit of illegal actions masking during the training process: https://arxiv.org/pdf/2006.14171.pdf
  2. Stable Baseline 3 PPO has a version that manages the illegal actions; maybe this could inspire a general solution also under other RL libraries: https://sb3-contrib.readthedocs.io/en/master/modules/ppo_mask.html
  3. this post , this one and this lead me to this not well-documented part of the TF-Agents library that maybe can be used for illegal actions masking: https://www.tensorflow.org/agents/api_docs/python/tf_agents/networks/mask_splitter_network/MaskSplitterNetwork