Open abfshaal opened 1 year ago
Thanks for suggesting @abfshaal.
Could you provide a code snippet of how you imagine this interface working? How would task initialization become aware of the model and dataset to be used?
Hi @ainoam,
The way I would image the code snippet to look like is the following
from clearml import TaskFromData
task, data = TaskFromData(**Kwargs_for_Dataset, task_name)
in the TaskFromData class (or any other class name) the code would use the Dataset class to get the data, and the Task class to initialise the task, using the dataset project and task_name. The class init would contain the following
data = Dataset.get(**Kwargs_for_dataset)
task = Task.init(project_name=data.name, task_name=STRING_VAR)
return task, data
This is mainly for training tasks, as it would streamline linking the dataset to the training projects it has been used in. I honestly did not give thoughts on how the model usage would be considered here.
Its rather unobvious, but this is already achievable by adding the alias
parameter in the Dataset.get()
call.
https://clear.ml/docs/latest/docs/clearml_data/data_management_examples/data_man_cifar_classification/#using-the-dataset
Apologies for my late replies, I was off the laptop for a while. This does store the dataset ID in its own nice way, and can definitely work for many use cases. However, I am looking for a way to use names instead of IDs or Links to navigate to the origin of the data. I believe this would easier point to where the data is from the experiment directly.
I also realised that there is a get_output_log_webpage method for the task but there is not for the dataset, is there a plan to add that? I would be happy to also add it myself and contribute to clearml in that regard.
@abfshaal
The intended use of ClearML datasets in model training (or any other ClearML task logic) is to Dataset.get()
within the context of an initialized task (Which, as @jax79sg noted, would provide a convenient trace).
Organizing your ClearML projects according to your dataset name is slightly too specific of a use-case for ClearML to provide an interface for, but you can definitely create your own wrapper on top if that what fits your use case.
I also realised that there is a get_output_log_webpage method for the task but there is not for the dataset, is there a plan to add that? I would be happy to also add it myself and contribute to clearml in that regard
That's a great idea. We'd love a PR if you're up for it 🙂
Proposal Summary
The idea of this feature is to unify the naming of the dataset with the projects, through initialising the task using a dataset id. After using the dataset id to initialise the project, a project with the same name as the dataset id will be created and run, instead of specifying the dataset separately, and starting the task project separately For example, if a dataset is named, Cat_videos_4k, a project would be initialised when using that dataset with that name, and for the task names, it can be extended to be whatever the model is. Things like, Cat_videos_4k/Yolov8 and so on.
Motivation
Save steps when using datasets with tasks. Ensure that datasets and experiments created using those datasets are easily identifiable.