Lightning-AI / litdata

Transform datasets at scale. Optimize datasets for fast AI model training.
Apache License 2.0
374 stars 42 forks source link

TreeSpec Error Accessing Data #388

Closed jmoller93 closed 1 month ago

jmoller93 commented 1 month ago

🐛 Bug

Every once in awhile I am experiencing a TreeSpec error where the data contains 0 leafs despite the schema asking for 8. I've found this to happen stochastically and it seems to not be specific to the dataset. I can consistently get this error if I have a blank datapoint, but if I have all the data, it happens sometimes. I've even confirmed that the index it is trying to grab has all of the data and can be grabbed with the same function immediately after failure.

Environment detail - PyTorch Version (e.g., 1.0): 2.3 - OS (e.g., Linux): Linux - How you installed PyTorch (`conda`, `pip`, source): pip - Python version: 3.11.9 - CUDA/cuDNN version: 12.4 - GPU models and configuration: NVIDIA Tesla T4
github-actions[bot] commented 1 month ago

Hi! thanks for your contribution!, great first issue!

jmoller93 commented 1 month ago

A little more context: this seems to be happening when I try to access multiple chunks of indexes at once. Basically, after using optimize and create my dataset, I want to access a batch of indices at random. There is a nonzero probability that I try to cross chunks of access, which leads to the above issue.

tchaton commented 1 month ago

Hey @jmoller93 how does your code looks like. The StreamingDataset is an iterable dataset. Access by index is more for debugging than anything else.

jmoller93 commented 1 month ago

Yeah I was hoping to quickly access random indices for the purpose of subsampling the dataset. I'm learning that you can't really do that feasibly. Might be worth adding to the documentation, but otherwise thanks!

tchaton commented 1 month ago

Hey @jmoller93, yes. Unfortunately, you can't randomly get items and make it fast.

Otherwise, we have built-in sub-sampling but you don't control on which section is sampled.

Feel free to make a PR to add the note on the README.