Open Bpoole908 opened 1 year ago
Hi @Bpoole908. Thanks for reporting this issue. Would you mind providing me with the sampled date in the two formats:
.npz
formatdatasets
formatand ensure that they both contain the same trajectories? That would be really helpful for investigating this issue.
@ernestum Yeah the notebook I linked kinda does that. I have updated the notebook to make sure it does exactly what you asked for though. It will now generate and save .npz and a HuggingFace Dataset
using the same trajectories. Additionally, it tests the indexing, saving, and loading times for both dataset types. I have also attached the data that it generates, though you can run the notebook itself to generate more data.
You can easily change the env_name
variable to MountainCar to see the difference in speeds in the notebook as well.
Did you get a solution to this problem?
Did you get a solution to this problem?
I haven't heard anything about a solution. For now, I just went back to using the old NumPy saving structure. I use the latest version of Imitation but I added back some of the NumPy specific saving and loading functions.
Ok! thank you, Did you try using any other implementation of imitation learning?
Ok! thank you, Did you try using any other implementation of imitation learning?
Yeah I used some versions before 1.0 and it seemed to work fine. I believe most of the code for loading is backwards compatible. If I recall, they just removed the saving functions which would save using NumPy so you have to go back and reuse those old saving functions instead of the new ones.
Thanks for that. Do you remember which version it was?
@Bpoole908 @nil123532 If you could provide your problematic datasets this would be awesome. Then we can replicate and resolve the issue. Btw.: The v1.0 has some performance improvements in the loading code.
I'd be happy to share my code with you to help pinpoint the issue. Alternatively, if it's more convenient, I can put together a brief video walkthrough where the problem occurs. My problem isn't specifically tied to my dataset. It's related to how images are stored as observations. Storing images in .npz files causes a spike in RAM usage. The Stable Baselines docs (https://stable-baselines.readthedocs.io/en/master/guide/pretrain.html) recommend saving images as paths instead of including them in .npz to mitigate this. Also, just to give you a complete picture, I'm currently working on training a dagger agent.
@Bpoole908 @nil123532 If you could provide your problematic datasets this would be awesome. Then we can replicate and resolve the issue. Btw.: The v1.0 has some performance improvements in the loading code.
I haven't tried using Datasets
with the newest 1.0 version but when I have some time Ill give it a go. I attached both a notebook and dataset generated by the notebook above. Keep in mind the notebook walks through and measures the time differences in loading when comparing .npz and Datasets
to clearly display the issue. If you need my conda environment I can work on posting that as a well.
I just tried with the latest version (commit a8b079c469bb145d1954814f22488adff944aa0d) using the notebook I posted earlier. Indexing HF datasets when they contain images is still VERY slow.
Problem
If the observations for a given task are images and stored using the
TrajectoryDatasetSequence
class, indexing is extremely slow. For instance, indexing one trajectory can take upwards of ~60 seconds when there are 2000 transitions in the trajectory. The consequence of this is that something like theflatten_trajectories()
, which requires an algorithm like BC, now takes a very long time to execute.I'm interested in knowing if I'm doing something wrong (optimization wise) or Huggingface's
Dataset
class is not able to handle storing and retrieving NumPy arrays efficiently. I am not particularly familiar with how theDataset
class works beyond it stores data using arrow files. Any insights would be greatly appreciated.I'm not so sure this is an actual bug or not so I have submitted as an enhancement for now.I have a notebook for replicating this problem here.
Solution
Unsure what the solution is. The biggest issue is when loading stored trajectories as it returns an instance of the
TrajectoryDatasetSequence
class. The old .npz seemed significantly faster as it bypasses the need for this class and allows for image observations to be loaded directly into memory. The downside of course is that all images are loaded into memory, making it hard to handle large expert datasets. This is the solution I'm currently going with as it has already been implemented and my current use cases only require small datasets (although this will change in the future).