Lightning-AI / litdata

Transform datasets at scale. Optimize datasets for fast AI model training.
Apache License 2.0
334 stars 39 forks source link

Magic FileSerializer is causing issues #178

Closed tchaton closed 2 months ago

tchaton commented 3 months ago

Some users have faced issues where they provide a filepath inside their sample and litdata automatically detects it is a valid file and read its content.

This made the code nice but unfortunately, some user un-aware of the behaviours were seeing too many chunks being created.

Example:

fallocate -l 50MB gentoo_root.img
from litdata import optimize

def fn(filepath):
    return filepath

optimize(
    fn=fn,
    inputs=["gentoo_root.img" for _ in range(10)],
    output_dir="./data",
    chunk_bytes="64MB",
    num_workers=1,
)

Each sample will store the entire 50MB file, so you endup with 10 chunks with 10 times the entire file instead of 1 chunk with 10 samples containing the filepath .

github-actions[bot] commented 3 months ago

Hi! thanks for your contribution!, great first issue!

deependujha commented 3 months ago

I think this is bcoz of the code in src/litdata/streaming/serializers.py file.

class StringSerializer(Serializer):
    # ...
    def can_serialize(self, data: str) -> bool:
        return isinstance(data, str) and not os.path.isfile(data)

class FileSerializer(Serializer):
    # ...
    def can_serialize(self, data: Any) -> bool:
        return isinstance(data, str) and os.path.isfile(data)

class VideoSerializer(Serializer):
        # ....
        def can_serialize(self, data: Any) -> bool:
              return isinstance(data, str) and os.path.isfile(data) and any(data.endswith(ext) for ext in self._EXTENSIONS)

StringSerializer's not os.path.isfile(data) is the main culprit here.


My approach

Optimizer will have one more parameter magic_serializer: bool (default = True). And it'll be passed down to serializer. And the updated code for StringSerializer will be:

class StringSerializer(Serializer):
    # ...
    def can_serialize(self, data: str) -> bool:
+        return isinstance(data, str) and not (magic_serializer and os.path.isfile(data))

Doing it only for StringSerializer will be sufficient, as _SERIALIZERS is an orderedDict, and subsequent ones won't be called. But, still I'll do it for all the serializer that depend on filepath to make the can_serialize method more robust. Screenshot from 2024-07-01 13-57-37


@tchaton please give your feedback on the approach.

tchaton commented 3 months ago

Hey @deependujha, I would do something different.

I would remove the FileSerializer from optimize serializers (so it isn't used for any new optimized dataset) but keep it for deserialization for backward compatibility ( in the StreamingDataset serializers )

deependujha commented 2 months ago

I just modified FileSerializer can_serialize:

class FileSerializer(Serializer):
    # ....
    def can_serialize(self, data: Any) -> bool:
        # return isinstance(data, str) and os.path.isfile(data)

        print("FileSerializer will be removed in the future.")
        return False

and, left everything as it is. Now, in case a string is returned by the fn and a file exists with this name, it'll be serialized by PickleSerializer. (PickelSerializer serializes string, not the file.)

By doing so, it's backward compatible. If an existing optimized dataset exists, it can be deserialized using FileSerializer, but for any new dataset, FileSerializer can't be used, as it'll return False in all cases.

Is this what you wanted? Or have I got something wrong.

tchaton commented 2 months ago

Hey @deependujha Smart ;) Let's do that.