Closed tchaton closed 2 months ago
Hi! thanks for your contribution!, great first issue!
I think this is bcoz of the code in src/litdata/streaming/serializers.py
file.
class StringSerializer(Serializer):
# ...
def can_serialize(self, data: str) -> bool:
return isinstance(data, str) and not os.path.isfile(data)
class FileSerializer(Serializer):
# ...
def can_serialize(self, data: Any) -> bool:
return isinstance(data, str) and os.path.isfile(data)
class VideoSerializer(Serializer):
# ....
def can_serialize(self, data: Any) -> bool:
return isinstance(data, str) and os.path.isfile(data) and any(data.endswith(ext) for ext in self._EXTENSIONS)
StringSerializer's not os.path.isfile(data)
is the main culprit here.
Optimizer
will have one more parameter magic_serializer: bool (default = True)
.
And it'll be passed down to serializer. And the updated code for StringSerializer
will be:
class StringSerializer(Serializer):
# ...
def can_serialize(self, data: str) -> bool:
+ return isinstance(data, str) and not (magic_serializer and os.path.isfile(data))
Doing it only for StringSerializer
will be sufficient, as _SERIALIZERS
is an orderedDict, and subsequent ones won't be called. But, still I'll do it for all the serializer that depend on filepath to make the can_serialize method more robust.
@tchaton please give your feedback on the approach.
Hey @deependujha, I would do something different.
I would remove the FileSerializer
from optimize serializers (so it isn't used for any new optimized dataset) but keep it for deserialization for backward compatibility ( in the StreamingDataset serializers )
I just modified FileSerializer can_serialize
:
class FileSerializer(Serializer):
# ....
def can_serialize(self, data: Any) -> bool:
# return isinstance(data, str) and os.path.isfile(data)
print("FileSerializer will be removed in the future.")
return False
and, left everything as it is. Now, in case a string is returned by the fn and a file exists with this name, it'll be serialized by PickleSerializer
. (PickelSerializer serializes string, not the file.)
By doing so, it's backward compatible. If an existing optimized dataset exists, it can be deserialized using FileSerializer
, but for any new dataset, FileSerializer
can't be used, as it'll return False in all cases.
Is this what you wanted? Or have I got something wrong.
Hey @deependujha Smart ;) Let's do that.
Some users have faced issues where they provide a filepath inside their sample and litdata automatically detects it is a valid file and read its content.
This made the code nice but unfortunately, some user un-aware of the behaviours were seeing too many chunks being created.
Example:
Each sample will store the entire 50MB file, so you endup with 10 chunks with 10 times the entire file instead of 1 chunk with 10 samples containing the filepath .