Closed fede72bari closed 1 year ago
Gym v0.25+ and Gymnasium support masking of the action space to disable certain actions which does the thing that you wish.
We recommend returning the action mask for each observation in the info
of env.step
.
Then you will need to update your policy, to sample using the action mask
Since I am learning, could you provide a reference to documentation with a full example? The critical parts for me are the masking creation and the usage during the training process when it is called the policy (even more integration with TF-Agents). Thank you very much for supporting.
I believe that cleanrl have an implementation that uses action masking otherwise, it is not that common to my knowledge
@pseudo-rnd-thoughts: My feeling is that a proper implementation belongs in spaces
. As the OP suggests, MDP formalisms often treat the state-dependency explicitly as $\mathcal{A}(s)$. For agents taking exploratory actions, it is desirable to have action_space.sample()
work without modification, IMO.
A while back I implemented a subclass of Discrete
that I called DiscreteMasked
, which used numpy.ma
and a mask
attribute that the Env
is responsible for changing throughout an episode.
@rademacher-p That sounds interesting, we would be interested in your implementation if it adds something new. This makes me think it would be nice to support a probability distribution over the actions rather than a purely binary option
I am a beginner in ML and RL; so, for sure, I cannot give a meaningful contribution to developing a solution for illegal actions masking. But I gave a glance around and found interesting things, provided that the masking shall act somehow on the NN architecture during the training process and consequently that the problem is not the mask definition in the Gym environment (it can be passed through the "info" structure), the issue is its utilization for training the agent.
Question
I saw other old posts spinning around this topic, but the general answers given were like "don't care, let the NN learn that, in that specific state, it cannot take some actions punishing it!". Well, I don't like it! For several reasons:
many publications speak about the action space not as A, but as A(s). So it is normal to consider the action space as a function of the current state s.
in reality if you have a wall on your left it is not just a matter of hurting yourself trying to pass it, you simply don't have this option. I cannot understand why my RL should still have a chance, even if not probable, to go left after the training
why should we accept extra learning effort to let the RL agent learn something that is already known?
just to mention some of the reasons. I saw that the Discrete object defined in the Gymnasium library has a masking array in order to declare which actions are available, but I can see it just in the random sampling function
A sample will be chosen uniformly at random with the mask if provided Rather than this, I think that implementing a "dynamic" actions space as a function of the current state should impact somehow during the training process on the agent.collect_policy. I am struggling on finding complete and working examples of how to implement such a simple capability. That is not so simple to me in the end and I would like to understand if there are not well-documented (as many other things regretfully) already developed and elegant solutions in the TF-Agents / TensorFlow context.