Unifying Datasets and experiments naming

allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution

https://clear.ml/docs

Apache License 2.0

5.69k stars 655 forks source link

Unifying Datasets and experiments naming #1042

Open abfshaal opened 1 year ago

abfshaal commented 1 year ago

Proposal Summary

The idea of this feature is to unify the naming of the dataset with the projects, through initialising the task using a dataset id. After using the dataset id to initialise the project, a project with the same name as the dataset id will be created and run, instead of specifying the dataset separately, and starting the task project separately For example, if a dataset is named, Cat_videos_4k, a project would be initialised when using that dataset with that name, and for the task names, it can be extended to be whatever the model is. Things like, Cat_videos_4k/Yolov8 and so on.

Motivation

Save steps when using datasets with tasks. Ensure that datasets and experiments created using those datasets are easily identifiable.

ainoam commented 1 year ago

Thanks for suggesting @abfshaal.

Could you provide a code snippet of how you imagine this interface working? How would task initialization become aware of the model and dataset to be used?

abfshaal commented 1 year ago

Hi @ainoam,

The way I would image the code snippet to look like is the following

from clearml import TaskFromData 

task, data = TaskFromData(**Kwargs_for_Dataset, task_name)

in the TaskFromData class (or any other class name) the code would use the Dataset class to get the data, and the Task class to initialise the task, using the dataset project and task_name. The class init would contain the following

data = Dataset.get(**Kwargs_for_dataset)
task = Task.init(project_name=data.name, task_name=STRING_VAR)
return task, data

This is mainly for training tasks, as it would streamline linking the dataset to the training projects it has been used in. I honestly did not give thoughts on how the model usage would be considered here.

jax79sg commented 1 year ago

Its rather unobvious, but this is already achievable by adding the alias parameter in the Dataset.get() call. https://clear.ml/docs/latest/docs/clearml_data/data_management_examples/data_man_cifar_classification/#using-the-dataset

abfshaal commented 1 year ago

Apologies for my late replies, I was off the laptop for a while. This does store the dataset ID in its own nice way, and can definitely work for many use cases. However, I am looking for a way to use names instead of IDs or Links to navigate to the origin of the data. I believe this would easier point to where the data is from the experiment directly.

I also realised that there is a get_output_log_webpage method for the task but there is not for the dataset, is there a plan to add that? I would be happy to also add it myself and contribute to clearml in that regard.

ainoam commented 1 year ago

@abfshaal

The intended use of ClearML datasets in model training (or any other ClearML task logic) is to Dataset.get() within the context of an initialized task (Which, as @jax79sg noted, would provide a convenient trace).

Organizing your ClearML projects according to your dataset name is slightly too specific of a use-case for ClearML to provide an interface for, but you can definitely create your own wrapper on top if that what fits your use case.

I also realised that there is a get_output_log_webpage method for the task but there is not for the dataset, is there a plan to add that? I would be happy to also add it myself and contribute to clearml in that regard

That's a great idea. We'd love a PR if you're up for it 🙂