Farama-Foundation / Minari

A standard format for offline reinforcement learning datasets, with popular reference datasets and related utilities
https://minari.farama.org
Other
267 stars 42 forks source link

[Question] Save only subset of obs space #57

Closed Howuhh closed 1 year ago

Howuhh commented 1 year ago

Question

I am trying to build a dataset for the MiniHack environment. The environment has a fairly large dictionary space, with many keys, but only one level of nesting. I don't want to save all the keys, because it bloats the size of the dataset very much. I thought it could be done with StepDataCallback, like this:

class TTYStepDataCallback(minari.StepDataCallback):
    def __call__(self, env, obs, info, action=None, rew=None, terminated=None, truncated=None):
        tty_filter_keys = ["tty_chars", "tty_colors", "tty_cursor"]
        obs = {k: v for k, v in obs.items() if k in tty_filter_keys}

        step_data = super().__call__(env, obs, info, action, rew, terminated, truncated)

        return step_data

However, this will raise an error, as some keys from original observation space are missing. Why can't I just use the ones I want to save right away? I am logging data during distributed PPO training, which uses the other keys too, so they are needed for online training :(

Howuhh commented 1 year ago

For example if some obs has dtype of int64, while needed tty_* has int8, it will save it all flattened with in64 dtype. For dataset with 500k samples this will be 45GB in my case, for 10M - ~900GB. However, with int8 it will be of reasonable size.

rodrigodelazcano commented 1 year ago

For example if some obs has dtype of int64, while needed tty_* has int8, it will save it all flattened with in64 dtype. For dataset with 500k samples this will be 45GB in my case, for 10M - ~900GB. However, with int8 it will be of reasonable size.

Is this because the obs space is of dtype int64 or because h5py is storing them with this type by default?

This is something we should look into, the problem is that the obs/actions are automatically flatten and for that we need to know the spaces https://gymnasium.farama.org/api/spaces/utils/#gymnasium.spaces.utils.flatten. We can make optional the flattening, the observations will be saved as a collection of separate datasets and we will need to modify the sampling function to return dictionaries.

What I don't like is having different spaces from the environment because of reproducibility. A solution can be having an action/observation space for the dataset and serialize them into json like the environment.

Howuhh commented 1 year ago

@rodrigodelazcano I think this is because some observations are of int64, so when flattening numpy will cast all other dtypes to int64. In this case it does not make a lot of sense to store it in another type in h5py, because then the accuracy of observations that need the int64 type will be lost.

A solution can be having an action/observation space for the dataset and serialize them into json like the environment

That's what was in my head when I imagined how it works currently. I think it is important, because all really complex environments (which benefit most from having a dataset) have non-trivial spaces and not always all of the keys need to be saved.

rodrigodelazcano commented 1 year ago

I agree, we just need to figure out how to specify and store the obs/action spaces of the dataset and be able to retrieve them when loading the dataset. I'll make a PR with a possible implementation of this, let me know if you have any ideas.

Also, if we are doing this I like your idea of storing dictionary spaces without flattening https://github.com/Farama-Foundation/Minari/pull/55#discussion_r1161003283. This is straight forward since Minari stores nested dictionaries into dataset collections. We can flatten the observations when sampling if required.

Howuhh commented 1 year ago

@rodrigodelazcano saving without flattening is also a good feature in my opinion, as unflattening huge observation spaces on the fly during sampling adds overhead to replay buffer sampling time (not that big, but it adds up over time)

younik commented 1 year ago

The problem with saving without flattening is that you cannot use a NumPy buffer. This causes much more performance drop than reshaping (that it is just changing the metadata of the array).

To address the question of this issue, I think you can do it by specifying your observation_space differently.

Nevertheless, I see saving without flattening helpful or necessary with Sequence

Howuhh commented 1 year ago

@rodrigodelazcano Unfortunately, as I said above, I can't specify another observation space, because the method that collects data during training uses some keys, while I save other keys and they don't overlap. Therefore, it must remain as it is.

The problem with saving without flattening is that you cannot use a NumPy buffer.

I don't quite understand why this is the case. If you save some custom space, the user can always make his own buffer with separate numpy arrays for each modality, and then sample as usual. In the general case, however, the buffer can work with dictionaries of numpy arrays and will be agnostic about the space structure. There will be almost no overhead here, it is still the same sampling, at most will be added a small loop for ~3-10 steps on the number of keys (which are usually not so big).

What bothers me much more (though maybe I don't understand something), and why I have now decided not to use Minari further for my project (and write my own simplified wrapper), is that while flattening all data is casted to the same largest type. This can turn just a big dataset into absolutely massive and it makes a difference.

rodrigodelazcano commented 1 year ago

I understand, this makes sense to me and I didn't think about this issue when I first implemented it. The beta release will be out soon, afterwards we will set in our roadmap:

Thanks a lot for the suggestions @Howuhh

younik commented 1 year ago

the buffer can work with dictionaries of numpy arrays

You are right; it wasn't clear before, thank you for the idea!

balisujohn commented 1 year ago

@Howuhh We should now support the pattern you need for minihack, please check this test https://github.com/Farama-Foundation/Minari/blob/main/tests/data_collector/callbacks/test_step_data_callback.py to see an example of the pattern, and let us know if it is sufficiently expressive for your use cases.

Moreover, we officially currently support Discrete Box Dict and Tuple spaces with arbitrary nesting, with plans to support Text spaces.

balisujohn commented 1 year ago

We will be issuing a minor release soon, once we have fixed a few of our first-party datasets to comply with the new standard, and slightly improve test coverage.

Howuhh commented 1 year ago

@balisujohn Cool! I think you need to fix the documentation for new format tho

balisujohn commented 1 year ago

We'll have that fixed shortly; the doc build failed.

balisujohn commented 1 year ago

@Howuhh https://minari.farama.org/main/content/dataset_standards/ The doc built from main now reflects the new dataset format.

younik commented 1 year ago

This issue should be solved now; I close it. @Howuhh feel free to open it again if I am missing something