allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.69k stars 655 forks source link

Optimize `Dataset.finalize()` runtime #1174

Open MightyGoldenJA opened 11 months ago

MightyGoldenJA commented 11 months ago

Proposal Summary

.finalize() should only pull current and parent dataset for diff generation, not the entirety of the ancestry tree.

Motivation

Avoid having .finalize() taking 20 mins on big version trees.

Related Discussion

https://clearml.slack.com/archives/CTK20V944/p1701856665977449

ainoam commented 11 months ago

Thanks for suggesting @MightyGoldenJA, We'll try to address this in a near release.