Closed dangthatsright closed 1 month ago
Hi! thanks for your contribution!, great first issue!
Hey @dangthatsright,
Nice, option 1 sounds good. Feel free to make a PR to fix it.
I think this is what was happening in my issue from before! Good fix https://github.com/Lightning-AI/litdata/issues/388
A simple fix would be to pad the chunk_bytes with the chunk header in the reader. Same as done within the writer. So it is backward compatible.
num_items = np.uint32(len(items))
sizes = list(map(len, items))
offsets = np.array([0] + sizes).cumsum().astype(np.uint32)
offsets += len(num_items.tobytes()) + len(offsets.tobytes())
sample_data = b"".join([item.data for item in items])
data = num_items.tobytes() + offsets.tobytes() + sample_data
that's a great idea, thank you!
🐛 Bug
When writing chunks,
chunk_bytes
is calculated via https://github.com/Lightning-AI/litdata/blob/b9aa903bd9c98cd96ee989394fdaa1a38f8036f0/src/litdata/streaming/writer.py#L237 but the actual data size is more since data contains additional (potentially large) metadata info in the beginningWhen reading chunks, there is a separate thread to download the chunks from cloud and a while loop that spins until the file size is larger than
chunk_bytes
see https://github.com/Lightning-AI/litdata/blob/b9aa903bd9c98cd96ee989394fdaa1a38f8036f0/src/litdata/streaming/item_loader.py#L146This means that there are edge cases where the reader is downloading the file, and the file exceeds
chunk_bytes
since the file is a larger size than that. The reader thinks the file is ready and indexes into an offset that doesn't exist yet, leading to downstream errors.To Reproduce
Since this is non deterministic, and involves large data, I don't have code, but if I can outline my scenario. You create large chunks (I'm using default of 64 MB), and then you index through the last data point of each chunk (I have > 100 chunks), you'll most likely hit this issue.
Maybe if you have even larger chunks with a lot of data, as long as the offset stored in the chunk is sufficiently large (since that doesn't get accounted for in the
chunk_bytes
info, and you index the last element, you'll probably see it is my guess.Expected behavior
This should work. I am happy to make a PR but unsure which direction to pursue. Several ideas:
chunk_bytes
to be the actual file size rather than just the size of data points. This is obviously the easiest but I'm not sure if this info is used somewhere else.