ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
I've come across a bug when creating a new child dataset that overwrites parent dataset files. The new files in the child are added as external files, but the files in the parent are zipped. As a result, the modified files are both in link entries and file entries, but due to how get_local_copy() is currently implemented, it always overwrites the newer external files with symlinks to the parent dataset files. The weird file/link issue is also reflected in the dataset dashboard file/link and files changed counters. I would've expected that it'd show either that files added to be 1 and modified 1 or added to be 2 and modified 0, but somehow it ends up at added 2 and modified 1.
I expect that parent symlinking is handled correctly for those zipped files that aren't overwitten by external files, and that an external file overwriting a zipped file wouldn't get overwritten with a symlink to the older parent dataset file.
Describe the bug
I've come across a bug when creating a new child dataset that overwrites parent dataset files. The new files in the child are added as external files, but the files in the parent are zipped. As a result, the modified files are both in link entries and file entries, but due to how get_local_copy() is currently implemented, it always overwrites the newer external files with symlinks to the parent dataset files. The weird file/link issue is also reflected in the dataset dashboard file/link and files changed counters. I would've expected that it'd show either that files added to be 1 and modified 1 or added to be 2 and modified 0, but somehow it ends up at added 2 and modified 1.
To reproduce
Code to reproduce: Pastebin
Expected behaviour
I expect that parent symlinking is handled correctly for those zipped files that aren't overwitten by external files, and that an external file overwriting a zipped file wouldn't get overwritten with a symlink to the older parent dataset file.
Environment
Related Discussion
Original slack message