allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.57k stars 644 forks source link

Data generation tracking #941

Open LPBurgess opened 1 year ago

LPBurgess commented 1 year ago

Proposal Summary

I have a script that generates large amount of data based off configuration files. When I create a Dataset, it would be nice if the configuration files that were used to generate the dataset was saved along with it.

Motivation

This would facilitate data generation and data versioning since we could track the configuration files that were used to create the data

Related Discussion

https://app.slack.com/client/TT9ATQXJ5/CTK20V944/thread/CTK20V944-1677537334.728259

ainoam commented 1 year ago

Thanks for an excellent suggestion @LPBurgess.

Until clearml-data provides a built-in interface for this, you can make use of the underlying infrastructure:

dataset._task.connect_configuration(configuration="path/to/file", name="my config")