Datasets re-downloaded if processed files change

rettigl commented 6 days ago

If you remove or rename processed buffer file folders inside a dataset, that data is being downloaded again, even if the dataset modules says it would not do it.

INFO - Not downloading Gd_W110 data as it already exists at "/mnt/pcshare/users/Laurenz/AreaB/sed/sed/tutorial/datasets/Gd_W110".
Set 'use_existing' to False if you want to download to a new location.
INFO - Using existing data path for "Gd_W110": "/mnt/pcshare/users/Laurenz/AreaB/sed/sed/tutorial/datasets/Gd_W110"
100%
 3.27G/3.27G [00:36<00:00, 95.2MB/s]
INFO - Download complete.
INFO - Extracting Gd_W110 data...
100%
 86/86 [00:11<00:00, 11.49file/s]
INFO - Gd_W110 data extracted successfully.
INFO - Removed Gd_W110.zip file.
INFO - Rearranging files in analysis_data.
100%
 63/63 [00:00<00:00, 168.89file/s]
INFO - File movement complete.
INFO - Rearranging files in calibration_data.
100%
 13/13 [00:00<00:00, 136.01file/s]
INFO - File movement complete.
INFO - Rearranging complete.
/mnt/pcshare/users/Laurenz/AreaB/sed/sed/tutorial/datasets/Gd_W110_

zain-sohail commented 6 days ago

Indeed not intended behavior. Can you look at the datasets.json dict/file? DatasetsManager.load_datasets_dict() should work.

I am assuming the processed folder was also saved in the list of files somehow. And if files from json are not matching files in folder, it tries to reextract the data.

rettigl commented 6 days ago

Yes, they are all in the json file. This should only contain the extracted files, I would say.

rettigl commented 6 days ago

And it's also pretty confusing that it says it reuses the existing data, yet still downloads them...

zain-sohail commented 5 days ago

There are two checks that take place. One is just checking if the path is in the json file. https://github.com/OpenCOMPES/sed/blob/86978c08be702f550ae10c04be1357cc012ebcf0/sed/dataset/dataset.py#L175-L183 Second check sees if the files match.
https://github.com/OpenCOMPES/sed/blob/86978c08be702f550ae10c04be1357cc012ebcf0/sed/dataset/dataset.py#L363-L364 The log messages need to be improved to reflect that.

Here, the issue is that the second check fails. I can't understand how the processed folder ended up in the files key. Somehow it went to the else condition even though data was present and overwrote the file_list here https://github.com/OpenCOMPES/sed/blob/86978c08be702f550ae10c04be1357cc012ebcf0/sed/dataset/dataset.py#L373

rettigl commented 5 days ago

Somehow I also cannot trigger this behavior right now anymore. I will close for now, until this happens again.

OpenCOMPES / sed

Datasets re-downloaded if processed files change #461