ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
So far, when calling get_local_copy method on a clearml.Dataset object, you would download all the files from the dataset AND all its parents recursively, or create soft links for all files that have been downloaded previously. But there was no way to get only files added to this particular version of the dataset, ignoring all the parents. This little PR implements this exact feature.
Testing Instructions
Register a parent dataset and add some files
Register a child dataset, inherited from the first one, and add some more files
Use only_added argument (False by default):
from clearml import Dataset
dataset = Dataset.get(dataset_name='child')
data_base_dir = dataset.get_local_copy(only_added=True)
data_base_dir will contain only files returned by list_added_files
Other Information
Potential issue:
Soft links are still being created for files in the diff which have already been downloaded - this is ok
But it you first call child.get_local_copy(only_added=True) and then once again child.get_local_copy(), it will not create soft links for existing files and download the diff once again -- probably not ok... The same applies to "grandchildren" datasets. Still figuring out why. On the other hand, this could be ok if we assume only_added=True flag is supposed to be used only for debug purposes or to quickly inspect datasets.
I believe that this will also download the modified files (which is good), but maybe the name only_added is not appropriate. How about ignore_parent_datasets?
So far, when calling
get_local_copy
method on aclearml.Dataset
object, you would download all the files from the dataset AND all its parents recursively, or create soft links for all files that have been downloaded previously. But there was no way to get only files added to this particular version of the dataset, ignoring all the parents. This little PR implements this exact feature.Testing Instructions
parent
dataset and add some fileschild
dataset, inherited from the first one, and add some more filesonly_added
argument (False
by default):data_base_dir
will contain only files returned bylist_added_files
Other Information
Potential issue:
Soft links are still being created for files in the diff which have already been downloaded - this is ok
But it you first call
child.get_local_copy(only_added=True)
and then once againchild.get_local_copy()
, it will not create soft links for existing files and download the diff once again -- probably not ok... The same applies to "grandchildren" datasets. Still figuring out why. On the other hand, this could be ok if we assumeonly_added=True
flag is supposed to be used only for debug purposes or to quickly inspect datasets.