allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.69k stars 654 forks source link

clearml-data not tracking deleted items. #785

Open jax79sg opened 2 years ago

jax79sg commented 2 years ago

Thank you for helping us making ClearML better!

Describe the bug

Given a clearml dataset that has 4 files. I created a child dataset and deleted 2 of the files before syncing it back and finalised. The task console on ClearML server shows 2 files removed. I performed a clearml-data get for the child dataset and noted that i get back all 4 files, instead of the expected 2 unremoved files only.

To reproduce

(venv) tkahsion@pop-os:~/projects/00.ingestion$ clearml-data create --project testfile --name papaset
clearml-data - Dataset Management & Versioning CLI
Creating a new dataset:
ClearML results page: https://app.clear.ml/projects/17afeba9fd57451fa2ea54deaec0e403/experiments/71bc88a785484d748bf842ada4e02b80/output/log
ClearML dataset page: https://app.clear.ml/datasets/simple/17afeba9fd57451fa2ea54deaec0e403/experiments/71bc88a785484d748bf842ada4e02b80
New dataset created id=71bc88a785484d748bf842ada4e02b80

(venv) tkahsion@pop-os:~/projects/00.ingestion$ clearml-data add --files *.txt
clearml-data - Dataset Management & Versioning CLI
Adding files/folder/links to dataset id 71bc88a785484d748bf842ada4e02b80
2022-09-23 17:56:17,955 - clearml - INFO - Dataset.get() did not specify alias. Dataset information won’t be automatically logged in ClearML Server.
4 files added

(venv) tkahsion@pop-os:~/projects/00.ingestion$ clearml-data upload
clearml-data - Dataset Management & Versioning CLI
uploading local files to dataset id 71bc88a785484d748bf842ada4e02b80
2022-09-23 17:56:59,178 - clearml - INFO - Dataset.get() did not specify alias. Dataset information won’t be automatically logged in ClearML Server.
Uploading dataset changes (4 files compressed to 370 B) to https://files.clear.ml
File compression and upload completed: total size 370 B, 1 chunk(s) stored (average size 370 B)
Dataset upload completed

(venv) tkahsion@pop-os:~/projects/00.ingestion$ clearml-data close
clearml-data - Dataset Management & Versioning CLI
Finalizing dataset id 71bc88a785484d748bf842ada4e02b80
2022-09-23 17:57:30,085 - clearml - INFO - Dataset.get() did not specify alias. Dataset information won’t be automatically logged in ClearML Server.
Dataset closed and finalized

(venv) tkahsion@pop-os:~/projects/00.ingestion$ clearml-data create --parents 71bc88a785484d748bf842ada4e02b80 --name babyset --tag baby
clearml-data - Dataset Management & Versioning CLI
Creating a new dataset:
2022-09-23 17:58:59,284 - clearml - INFO - Dataset.get() did not specify alias. Dataset information won’t be automatically logged in ClearML Server.
ClearML results page: https://app.clear.ml/projects/1c172a43b62a40b7a26f11f1e9f869ed/experiments/3438ea853795471e940b8306daededb4/output/log
ClearML dataset page: https://app.clear.ml/datasets/simple/1c172a43b62a40b7a26f11f1e9f869ed/experiments/3438ea853795471e940b8306daededb4
New dataset created id=3438ea853795471e940b8306daededb4

(venv) tkahsion@pop-os:~/projects/00.ingestion$ mkdir papaset

(venv) tkahsion@pop-os:~/projects/00.ingestion$ clearml-data get --copy papaset
clearml-data - Dataset Management & Versioning CLI
Download dataset id 3438ea853795471e940b8306daededb4
2022-09-23 18:01:07,263 - clearml - INFO - Dataset.get() did not specify alias. Dataset information won’t be automatically logged in ClearML Server.
Dataset local copy available for files at: papaset

(venv) tkahsion@pop-os:~/projects/00.ingestion$ ls papaset/
1.txt  2.txt  3.txt  4.txt

(venv) tkahsion@pop-os:~/projects/00.ingestion$ rm papaset/1.txt papaset/2.txt 

(venv) tkahsion@pop-os:~/projects/00.ingestion$ ls papaset/
3.txt  4.txt

(venv) tkahsion@pop-os:~/projects/00.ingestion$ clearml-data sync  --folder papaset
clearml-data - Dataset Management & Versioning CLI
Syncing dataset id 3438ea853795471e940b8306daededb4 to local folder papaset
2022-09-23 18:02:58,472 - clearml - INFO - Dataset.get() did not specify alias. Dataset information won’t be automatically logged in ClearML Server.
Generating SHA2 hash for 2 files
Hash generation completed
Sync completed: 2 files removed, 0 added, 0 modified
Finalizing dataset
Pending uploads, starting dataset upload to https://files.clear.ml
File compression and upload completed: total size 0 B, 0 chunk(s) stored (average size 0 B)
Dataset closed and finalized

(venv) tkahsion@pop-os:~/projects/00.ingestion$ sudo rm -r papaset
[sudo] password for tkahsion: 

(venv) tkahsion@pop-os:~/projects/00.ingestion$ clearml-data get --id 3438ea853795471e940b8306daededb4 --copy babyset
clearml-data - Dataset Management & Versioning CLI
Download dataset id 3438ea853795471e940b8306daededb4
2022-09-23 18:05:04,024 - clearml - INFO - Dataset.get() did not specify alias. Dataset information won’t be automatically logged in ClearML Server.
Dataset local copy available for files at: babyset

(venv) tkahsion@pop-os:~/projects/00.ingestion$ ls babyset/
1.txt  2.txt  3.txt  4.txt

Expected behaviour

What is the expected behaviour? What should've happened but didn't?

(venv) tkahsion@pop-os:~/projects/00.ingestion$ ls babyset/
3.txt  4.txt

Environment

erezalg commented 2 years ago

Hi @jax79sg,

Sorry for the slow reply on this. I'm actually not 100% sure what should happen on clearml's side as on the one hand you added files with clearml-data add, then removed it by calling clearml-data sync.

Can you shed some light on your workflow? Why not use sync on both datasets, or use add \ remove on both respectively?