allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.61k stars 651 forks source link

an option to download only added files for a given dataset version #1100

Open kirillfish opened 1 year ago

kirillfish commented 1 year ago

So far, when calling get_local_copy method on a clearml.Dataset object, you would download all the files from the dataset AND all its parents recursively, or create soft links for all files that have been downloaded previously. But there was no way to get only files added to this particular version of the dataset, ignoring all the parents. This little PR implements this exact feature.

Testing Instructions

  1. Register a parent dataset and add some files
  2. Register a child dataset, inherited from the first one, and add some more files
  3. Use only_added argument (False by default):
    from clearml import Dataset
    dataset = Dataset.get(dataset_name='child')
    data_base_dir = dataset.get_local_copy(only_added=True)

    data_base_dir will contain only files returned by list_added_files

Other Information

Potential issue:

Soft links are still being created for files in the diff which have already been downloaded - this is ok

But it you first call child.get_local_copy(only_added=True) and then once again child.get_local_copy(), it will not create soft links for existing files and download the diff once again -- probably not ok... The same applies to "grandchildren" datasets. Still figuring out why. On the other hand, this could be ok if we assume only_added=True flag is supposed to be used only for debug purposes or to quickly inspect datasets.

eugen-ajechiloae-clearml commented 1 year ago

I believe that this will also download the modified files (which is good), but maybe the name only_added is not appropriate. How about ignore_parent_datasets?

kirillfish commented 1 year ago

@eugen-ajechiloae-clearml done, I renamed it