allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.53k stars 642 forks source link

Enhancement Request: Improved ClearML Data Management for Child Datasets #1091

Open Heegreis opened 1 year ago

Heegreis commented 1 year ago

Proposal Summary

I've been using ClearML Data and encountered several issues with child datasets. Specifically:

  1. When renaming or changing the path of files that are the same, the "FILES CHANGED" log shows them as "Added 1" and "Removed 1". It would be more intuitive if they were recorded as "Renamed", similar to Git's behavior. Additionally, the fileserver retains duplicate files even after renaming, which could be addressed by linking files in child datasets to parent dataset files using their SHA identifiers. Here's the process I followed using ClearML Data to rename files in a child dataset: remove the file -> add the same file with a new name.

  2. If a file is removed and then the same file (with the same filename and path) is added back, the "FILES CHANGED" log registers it as "Modified 1". However, in essence, no actual changes were made to the dataset content. Furthermore, the fileserver stores identical files (same filename, path, and content) redundantly.

Motivation

By addressing these issues, I believe we can achieve better dataset state management and significantly reduce the fileserver's storage consumption.

ainoam commented 1 year ago

Thanks for proposing @Heegreis.

We'll look into how this can be addressed in future versions.

Danie1Nash commented 1 month ago

I would like to know if there are any plans to implement this feature in ClearML in the future.

Thank you!

ainoam commented 3 weeks ago

Thanks for the interest @Danie1Nash This item is definitely in the roadmap, but has not yet been slated for a specific release.