[api] Observation space feature types

cswinter commented 2 years ago

Currently, the observation space of each Entity is defined as a single flat list of features which are all assumed to be scalars:

https://github.com/entity-neural-network/incubator/blob/be33cae8355d0f79374718e2293983bfc5827779/entity_gym/entity_gym/environment.py#L86-L88

There are other feature shapes that we might want to support:

0D Tensor (scalar): Already have this
1D Tensor (vector): Already sort of supported since you can flatten it into scalars. but you could imagine a feature that consists of 32 scalar values which don't really make sense to name individually.
2D Tensor (matrix): Could represent e.g. a tilemap or image.
3D Tensor: Map with multiple layers/channels.
N-D Tensor: ???

As @jeremysalwen noted, explicitly modeling the structure of the input space in this has at least two advantages:

We allow for more efficient network architectures with deeper awareness of the feature structure, e.g. using ConvNet for maps.
We can support a richer set of actions tied to the observation space, e.g. selecting specific coordinates of a map

Another consideration is the allowed type of elements. Currently, we only support floats, but we should at least also support int/categorical features that can be one-hot encoded. (tracked in entity-neural-network/entity-gym#2)

dtch1997 commented 2 years ago

I'd like to start working on this. For a quick start we could simply refactor Entity.features to be a Dict[str, np.ndarray] and then rework other interfaces that rely on Entity.

If we want to be more proper about this, we could introduce a new type Tensor that has a value property and a get_shape() method (and other convenience methods) and let Entity.features be a Dict[str, Tensor]. Tensor will simply be a wrapper around a numpy or torch tensor but it prevents us being tightly coupled to a single framework.

@cswinter and others, curious to hear your thoughts on this approach?

cswinter commented 2 years ago

Some initial thoughts:

There are separate questions: (1) how to describe the observation space (currently Entity), and (2) what is the representation/memory layout of observations (currently np.NDArray[np.float32])
- For (1), I think we want something more descriptive than an np.ndarray, maybe a set of dataclasses. To describe the shape of features, probably a tuple will do, similar to the shape property of np.ndarray/torch.tensor.
- For (2), flattening everything into a single np.NDArray[np.float32] is the most straightforward/performant solution. For debugging purposes, we'll want an a method that can extract the value of a given feature from the flattened array.
We currently don't have any environments that actually have anything other than scalar features, or a network architecture that could make use of those.
Related issue which we'll need to take into consideration and might want to tackle first: entity-neural-network/entity-gym#2

dtch1997 commented 2 years ago

Thanks for the response @cswinter ! Perhaps there's some considerations I'm not seeing, but I think this and entity-neural-network/entity-gym#2 can be solved together using the approach I outlined above, with some modifications. So a Tensor could be defined to be a Union[Continuous, Discrete]. For example, currently the observation space of MultiSnake looks like this:

  ObsSpace(
            {
                "SnakeHead": Entity(["x", "y", "color"]),
                "SnakeBody": Entity(["x", "y", "color"]),
                "Food": Entity(["x", "y", "color"]),
            }
        )

Instead of "Food": Entity(["x", "y", "color"]), we could have something like:

"Food": Entity({
        "x":         Continuous(shape=(,)),    # scalar shape is an empty tuple
        "y":         Continuous(shape=(,)), 
         "color":  Discrete(num_values=4) # categorical variable 
    }),

cswinter commented 2 years ago

Yep, that's pretty much what I had in mind!

cswinter commented 2 years ago

Small suggestion, I think it would be good for the class names to be short since they will get written a lot of times. Maybe Float or Real instead of Continuous? Can't think of anything that would be shorter than Discrete.

cswinter commented 2 years ago

Some random thoughts:

Discrete values could also have a shape. Maybe we want to separately specify types and shapes?
We may want to designate features as Discrete/Int (e.g. positions on a grid) and use them in e.g. absolute positional encodings or embeddings, but not one-hot encode them. Maybe the encoding should also be specified separately? Maybe that's not a good idea though.

dtch1997 commented 2 years ago

Hey @cswinter I started implementing the interface we discussed: https://github.com/dtch1997/incubator/tree/feature/add_feature_types

A quick question: How set are you on having the internal representationfor an Entity instance be a single flat np.ndarray? It seems like that creates a few more problems because 1) the different feature values could have different data types and 2) we'd need to have functions for converting the internal representation to the correct data type / shape of the feature.

To me the most elegant solution would be to let an Entity instance (let's call it EntityValue) be a Dict[str, np.ndarray] instead of a flat np.ndarray which avoids having to do any conversion and also makes it much simpler to filter the features to form an ObservationSpace.

I was also thinking that we could allow Entity to be a hierarchial construct, i.e. Entity.features can be a Dict[str, Union['Entity', Variable] so that we can compose smaller entities to form larger ones. Let me know your thoughts.

cswinter commented 2 years ago

To me the most elegant solution would be to let an Entity instance (let's call it EntityValue) be a Dict[str, np.ndarray] instead of a flat np.ndarray which avoids having to do any conversion and also makes it much simpler to filter the features to form an ObservationSpace.

Yeah I think something like the Dict[str, np.ndarray] might be what we want the API to ultimately look like, it's just going to be a good amount of work to still allow it to be performant and we'll have to be careful about how exactly it's set up. There's a couple of places that could become a bottleneck for environments that have more than a small number of features:

For an environment with many features, even just creating a Python object/numpy array for each feature and populating the dictionary could be a bottleneck.
Concatenating observations from multiple environments into a batch will be slow if we have to iterate over a large Python dict. This can be avoided by environments implementing the VecEnv interface directly so might be OK.
Pushing a (batched) observation onto the sample buffer, and shuffling the sample buffer, will be too slow if we have to iterate over a large Python dict.
On each forward pass, we still need to turn all features into a contiguous tensor (or maybe one tensor per data type). Again, iterating over a large Python dict probably won't cut it.

I'm fairly sure this could all still be done efficiently in some way, maybe with a version of the RaggedBuffer type that supports multiple features and handles all the iterating over features internally. We probably still want to convert the more condensed representation used by the network architecture as soon as possible so we only need to perform the conversion once. Probably as soon as we receive the observation from the environment, and before feeding them to the network and pushing them onto the sample buffer.

The approach I would take is to first figure out what the efficient encoded representation of everything should be, since this is what we want the network architecture and PPO code to use (and also makes that code a lot simpler). Right now, our network architecture doesn't support anything other than a flat tensor of floats and it's slightly unclear how more complex things are going to work, so I think it makes sense to still stick to that representation at least internally. We can then add a conversion layer that enables a more ergonomic API for environments, while still allowing them to directly supply the more efficient representation.

cswinter commented 2 years ago

I was also thinking that we could allow Entity to be a hierarchial construct, i.e. Entity.features can be a Dict[str, Union['Entity', Variable] so that we can compose smaller entities to form larger ones. Let me know your thoughts.

We could already compose entities by just merging the feature dict (and merge in multiple instances of the same entity by prefixing the features). I suppose modeling this at the level of the API could allow for things like joint feature normalization across the sub-entities. My sense is that this would complicate a lot of code and wouldn't be worth the trouble at this time. But I also don't think I fully understand the use case for this yet, did you have a particular example in mind?

dtch1997 commented 2 years ago

Yeah I think something like the Dict[str, np.ndarray] might be what we want the API to ultimately look like, it's just going to be a good amount of work to still allow it to be performant and we'll have to be careful about how exactly it's set up.

I see, okay. Concretely I was thinking of having both an object-oriented version and a flattened version of Observation. The object-oriented version can be used internally by Environment, such as in Environment._compile_feature_filter. Then it can be flattened once it is passed to the neural network. I haven't encountered any cases where you would need to do the reverse operation (going from flattened representation to the object-oriented one). IMO splitting it up like this would make implementation of new Environments a lot simpler and avoid most of the performance issues you described.

FWIW, in my work with OpenAI gym this is mostly how I manage complex observation spaces too. The gym.Env can have a dictionary observation space and it gets flattened down to an array just before it gets passed to the policy network.

We could already compose entities by just merging the feature dict (and merge in multiple instances of the same entity by prefixing the features). I suppose modeling this at the level of the API could allow for things like joint feature normalization across the sub-entities. My sense is that this would complicate a lot of code and wouldn't be worth the trouble at this time. But I also don't think I fully understand the use case for this yet, did you have a particular example in mind?

I think there isn't a solid need for this yet in the currently implemented environments, but it might become a useful abstraction for more complex environments which have a natural hierarchy / structure to them. It's more of a forward-thinking design decision, we definitely don't have to worry about it for now.

cswinter commented 2 years ago

I think the main use case for going from flattened -> object-oriented version would be things like debugging, logging metrics of (flattened) feature statistics, turning recorded sample traces back into features.

entity-neural-network / entity-gym

[api] Observation space feature types #1