allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.7k stars 657 forks source link

Incorrect handling of modified files when creating a child dataset with external links on a zipped parent #1323

Open Phanty133 opened 2 months ago

Phanty133 commented 2 months ago

Describe the bug

I've come across a bug when creating a new child dataset that overwrites parent dataset files. The new files in the child are added as external files, but the files in the parent are zipped. As a result, the modified files are both in link entries and file entries, but due to how get_local_copy() is currently implemented, it always overwrites the newer external files with symlinks to the parent dataset files. The weird file/link issue is also reflected in the dataset dashboard file/link and files changed counters. I would've expected that it'd show either that files added to be 1 and modified 1 or added to be 2 and modified 0, but somehow it ends up at added 2 and modified 1. image

To reproduce

Code to reproduce: Pastebin

Expected behaviour

I expect that parent symlinking is handled correctly for those zipped files that aren't overwitten by external files, and that an external file overwriting a zipped file wouldn't get overwritten with a symlink to the older parent dataset file.

Environment