Indexing a TrajectoryDatasetSequence instance when observations are images is extremely slow.

Bpoole908 commented 1 year ago

Problem

If the observations for a given task are images and stored using the TrajectoryDatasetSequence class, indexing is extremely slow. For instance, indexing one trajectory can take upwards of ~60 seconds when there are 2000 transitions in the trajectory. The consequence of this is that something like the flatten_trajectories(), which requires an algorithm like BC, now takes a very long time to execute.

I'm interested in knowing if I'm doing something wrong (optimization wise) or Huggingface's Dataset class is not able to handle storing and retrieving NumPy arrays efficiently. I am not particularly familiar with how the Dataset class works beyond it stores data using arrow files. Any insights would be greatly appreciated.I'm not so sure this is an actual bug or not so I have submitted as an enhancement for now.

I have a notebook for replicating this problem here.

Solution

Unsure what the solution is. The biggest issue is when loading stored trajectories as it returns an instance of the TrajectoryDatasetSequence class. The old .npz seemed significantly faster as it bypasses the need for this class and allows for image observations to be loaded directly into memory. The downside of course is that all images are loaded into memory, making it hard to handle large expert datasets. This is the solution I'm currently going with as it has already been implemented and my current use cases only require small datasets (although this will change in the future).

ernestum commented 1 year ago

Hi @Bpoole908. Thanks for reporting this issue. Would you mind providing me with the sampled date in the two formats:

Once in the old .npz format
Once in the new HuggingFace datasets format

and ensure that they both contain the same trajectories? That would be really helpful for investigating this issue.

Bpoole908 commented 1 year ago

@ernestum Yeah the notebook I linked kinda does that. I have updated the notebook to make sure it does exactly what you asked for though. It will now generate and save .npz and a HuggingFace Dataset using the same trajectories. Additionally, it tests the indexing, saving, and loading times for both dataset types. I have also attached the data that it generates, though you can run the notebook itself to generate more data.

You can easily change the env_name variable to MountainCar to see the difference in speeds in the notebook as well.

dataset-test-and-data.zip

nil123532 commented 1 year ago

Did you get a solution to this problem?

Bpoole908 commented 1 year ago

Did you get a solution to this problem?

I haven't heard anything about a solution. For now, I just went back to using the old NumPy saving structure. I use the latest version of Imitation but I added back some of the NumPy specific saving and loading functions.

nil123532 commented 1 year ago

Ok! thank you, Did you try using any other implementation of imitation learning?

Bpoole908 commented 1 year ago

Ok! thank you, Did you try using any other implementation of imitation learning?

Yeah I used some versions before 1.0 and it seemed to work fine. I believe most of the code for loading is backwards compatible. If I recall, they just removed the saving functions which would save using NumPy so you have to go back and reuse those old saving functions instead of the new ones.

nil123532 commented 1 year ago

Thanks for that. Do you remember which version it was?

ernestum commented 1 year ago

@Bpoole908 @nil123532 If you could provide your problematic datasets this would be awesome. Then we can replicate and resolve the issue. Btw.: The v1.0 has some performance improvements in the loading code.

nil123532 commented 1 year ago

I'd be happy to share my code with you to help pinpoint the issue. Alternatively, if it's more convenient, I can put together a brief video walkthrough where the problem occurs. My problem isn't specifically tied to my dataset. It's related to how images are stored as observations. Storing images in .npz files causes a spike in RAM usage. The Stable Baselines docs (https://stable-baselines.readthedocs.io/en/master/guide/pretrain.html) recommend saving images as paths instead of including them in .npz to mitigate this. Also, just to give you a complete picture, I'm currently working on training a dagger agent.

Bpoole908 commented 1 year ago

@Bpoole908 @nil123532 If you could provide your problematic datasets this would be awesome. Then we can replicate and resolve the issue. Btw.: The v1.0 has some performance improvements in the loading code.

I haven't tried using Datasets with the newest 1.0 version but when I have some time Ill give it a go. I attached both a notebook and dataset generated by the notebook above. Keep in mind the notebook walks through and measures the time differences in loading when comparing .npz and Datasets to clearly display the issue. If you need my conda environment I can work on posting that as a well.

Bpoole908 commented 7 months ago

I just tried with the latest version (commit a8b079c469bb145d1954814f22488adff944aa0d) using the notebook I posted earlier. Indexing HF datasets when they contain images is still VERY slow.

HumanCompatibleAI / imitation

Indexing a TrajectoryDatasetSequence instance when observations are images is extremely slow. #732

Problem

Solution