alex-petrenko / sample-factory

High throughput synchronous and asynchronous reinforcement learning
https://samplefactory.dev
MIT License
773 stars 106 forks source link

Support for more complex action distributions #253

Open theOGognf opened 1 year ago

theOGognf commented 1 year ago

The current action distribution model has some restrictions that inhibits richer families of action distributions for complex environments. As far as I can tell, only single space distributions or tuple space distributions are supported. Many custom environments make use of action masking and autoregressive distributions for handling complex action spaces. It'd be nice if there was an interface for registering custom action distributions much like registering other components.

alex-petrenko commented 1 year ago

This is a very reasonable inquiry. This would be a great feature to have in a future release. BTW, contributions are welcome, and I'd be happy to review code/provide suggestions if you decide to take on it!

@theOGognf what are the specific environments that you have in mind? Having concrete examples might help!

For now I would recommend forking the code and implementing the action distribution in a manner similar to how Tuple or other action distributions are implemented

theOGognf commented 1 year ago

Thanks for the quick response, Alex. I'd be happy to take a stab at it.

I can't share my environments, but there are a couple of examples from RLlib that get the point across. For action masking, an action mask is part of the observation and used to mask logits going into a model. Autoregressive distributions are usually specific to environments, but the whole point is building a model that can condition action heads on one another.

Here's RLlib's corresponding thread on supporting autoregressive distributions as well for reference.

I think it'd be easy to support if TensorDicts were passed between components rather than flattened (*vectors) observations, but I imagine that'd be a bit of a breaking change. Would a change like that be okay?

alex-petrenko commented 1 year ago

SF actually supports dictionaries of observations out of the box, so passing action masks along with observations should not be a problem. Just define an env with a dictionary observation space, and SF should correctly handle any number of key-value observations.

We're also already using TensorDict to pass these observations around, so this should not be an issue.

There is one design limitation motivated by performance considerations. All tensors (observations, sampled actions, etc.) should have fixed predetermined size. In the case of masked actions, this shouldn't be a problem, but autoregressive actions might have varied size (I think?) In this case I recommend allocating tensors for the maximum action length.

There are currently two abstractions related to action distributions: ActionParameterizations (this is the part of the policy that outputs the parameters of the action distribution) and action distributions themselves.

I think to implement this properly we need facilities to define both custom parameterizations and custom ActionDistribution classes, which should have a well defined interface. E.g. action distributions should support sampling, entropy calculation, KL-divergence calculation (or at least some proxy of it), calculating logprob of a sampled action.

In case of masked actions this action distribution object will be stateful (i.e. holding a valid action mask) In case of autoregressive distributions we need some custom logic in the sample() function.

Overall, this seems doable! I'm excited to see this feature and I'd be happy to help