Closed hiyyg closed 2 months ago
Hi! thanks for your contribution!, great first issue!
Any idea?
When I use compression="zstd"
, the size is still larger than the original size (e.g., 2.9G --> 3.5G).
Hey @hiyyg, Can you provide more information on what you are doing, otherwise those kind of issues aren't helpful for anyone.
Can you provide code sample to reproduce the issue, the index.json file associated to the chunks.
Most likely, you are storing the data into an un-optimized format leading to the data taking more space.
Example:
Because uint16 depth data can not be saved in JPEG format, I directly save the original data using numpy arrays. I suppose that the optimized data should not be much larger than the original data. However, it is not as expected. So could you elaborate what litdata adds to the original data that makes it much larger?
Hey @hiyyg. Can you share a reproducible code snippet with synthetic data ? Alternatively, you can convert you data to bytes directly, so LitData doesn't try anything clever and you retain full control.
I tried using pickle.dumps()
to save them in bytes. But the saved dataset size remained the same.
Hey @hiyyg Can you share a reproducible code snippet with synthetic data ? I can't help you otherwise.
When I was using no compression the 54MB data file was saved as 1.6GB! Introducing compression="zstd"
reduced the size to 16MB. However, compression seems to be undocumented feature. Could you please document and share recommendations?
Hey @AugustDev. Do you want to make a PR and update the README to add compression support ?
Hey @AugustDev. I updated the README to include compression. Let me know if I can close this issue.
Is this normal? How could we compress the dataset during dataset optimization or at least keep the original size?