Optimized dataset size is > 2x larger than the original dataset size.

Lightning-AI / litdata

Transform datasets at scale. Optimize datasets for fast AI model training.

Apache License 2.0

346 stars 39 forks source link

Optimized dataset size is > 2x larger than the original dataset size. #261

Closed hiyyg closed 2 months ago

hiyyg commented 2 months ago

Is this normal? How could we compress the dataset during dataset optimization or at least keep the original size?

github-actions[bot] commented 2 months ago

Hi! thanks for your contribution!, great first issue!

hiyyg commented 2 months ago

Any idea?

hiyyg commented 2 months ago

When I use compression="zstd", the size is still larger than the original size (e.g., 2.9G --> 3.5G).

tchaton commented 2 months ago

Hey @hiyyg, Can you provide more information on what you are doing, otherwise those kind of issues aren't helpful for anyone.

Can you provide code sample to reproduce the issue, the index.json file associated to the chunks.

Most likely, you are storing the data into an un-optimized format leading to the data taking more space.

Example:

Storing image tensor vs JPEG is 10-100x bigger
Storing video tensor vs av1 is 1000-5000x bigger

hiyyg commented 2 months ago

Because uint16 depth data can not be saved in JPEG format, I directly save the original data using numpy arrays. I suppose that the optimized data should not be much larger than the original data. However, it is not as expected. So could you elaborate what litdata adds to the original data that makes it much larger?

tchaton commented 2 months ago

Hey @hiyyg. Can you share a reproducible code snippet with synthetic data ? Alternatively, you can convert you data to bytes directly, so LitData doesn't try anything clever and you retain full control.

hiyyg commented 2 months ago

I tried using pickle.dumps() to save them in bytes. But the saved dataset size remained the same.

tchaton commented 2 months ago

Hey @hiyyg Can you share a reproducible code snippet with synthetic data ? I can't help you otherwise.

AugustDev commented 2 months ago

When I was using no compression the 54MB data file was saved as 1.6GB! Introducing compression="zstd" reduced the size to 16MB. However, compression seems to be undocumented feature. Could you please document and share recommendations?

tchaton commented 2 months ago

Hey @AugustDev. Do you want to make a PR and update the README to add compression support ?

tchaton commented 2 months ago

Hey @AugustDev. I updated the README to include compression. Let me know if I can close this issue.