Closed jmoller93 closed 1 month ago
Hi! thanks for your contribution!, great first issue!
A little more context: this seems to be happening when I try to access multiple chunks of indexes at once. Basically, after using optimize
and create my dataset, I want to access a batch of indices at random. There is a nonzero probability that I try to cross chunks of access, which leads to the above issue.
Hey @jmoller93 how does your code looks like. The StreamingDataset is an iterable dataset. Access by index is more for debugging than anything else.
Yeah I was hoping to quickly access random indices for the purpose of subsampling the dataset. I'm learning that you can't really do that feasibly. Might be worth adding to the documentation, but otherwise thanks!
Hey @jmoller93, yes. Unfortunately, you can't randomly get items and make it fast.
Otherwise, we have built-in sub-sampling but you don't control on which section is sampled.
Feel free to make a PR to add the note on the README.
🐛 Bug
Every once in awhile I am experiencing a TreeSpec error where the data contains 0 leafs despite the schema asking for 8. I've found this to happen stochastically and it seems to not be specific to the dataset. I can consistently get this error if I have a blank datapoint, but if I have all the data, it happens sometimes. I've even confirmed that the index it is trying to grab has all of the data and can be grabbed with the same function immediately after failure.
Environment detail
- PyTorch Version (e.g., 1.0): 2.3 - OS (e.g., Linux): Linux - How you installed PyTorch (`conda`, `pip`, source): pip - Python version: 3.11.9 - CUDA/cuDNN version: 12.4 - GPU models and configuration: NVIDIA Tesla T4