allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.64k stars 652 forks source link

`Dataset.add_files` doesn't support multiple wildcards, although its documentation states otherwise #609

Open antifriz opened 2 years ago

antifriz commented 2 years ago

This is how at the time of writing add_files looks like:

def add_files(
            self,
            path,  # type: Union[str, Path, _Path]
            wildcard=None,  # type: Optional[Union[str, Sequence[str]]]
            local_base_folder=None,  # type: Optional[str]
            dataset_path=None,  # type: Optional[str]
            recursive=True,  # type: bool
            verbose=False  # type: bool
    ):

Although here it states that wildcard can be a list of strings(wildcards)

"""
...
:param wildcard: add only specific set of files.
            Wildcard matching, can be a single string or a list of wildcards)
...
"""

It actually isn't supported since first argument of both Path.glob and Path.rglob cannot be a list of strings. See here.

The use case I'd like to be supported is a large root directory where only a subset of files should be added to the dataset. I'd like to pass the list of files and have a single call of the method add_files do the rest.

erezalg commented 2 years ago

Thanks for catching that :) We'll make sure to fix this!

erezalg commented 2 years ago

Hello @antifriz, We've just released clearml 1.4.0 that fixes this issue. Let us know if it works as expected!