dandi / dandisets

737 Dandisets, 812.2 TB total. DataLad super-dataset of all Dandisets from https://github.com/dandisets
10 stars 0 forks source link

update-from-backup: add --force=consistency-update option #231

Closed yarikoptic closed 2 years ago

yarikoptic commented 2 years ago

Follow up to https://github.com/dandi/dandisets/issues/230#issuecomment-1194885177 . We should be able to ensure complete check/fix up (add/remove files and ensure up-to-date .dandi/assets.json) without shortcuts: should get a list of assets, ensure that they (and only they) are available locally, save current .dandi/assets.json and dandi.yaml. ATM update of 000026 where we manually reflected removal of some assets via git rm && git commit --amend since prior run failed to rm, if we run

(base) dandi@drogon:/mnt/backup/dandi/dandisets$ ( set -eu; conda activate dandisets ; python -m tools.backups2datalad -l INFO --backup-root "/mnt/backup/dandi/"  --config tools/backups2datalad.cfg.yaml update-from-backup --workers 2 000026 ; )
2022-07-26T11:40:36-0400 [INFO    ] backups2datalad: Saving logs to /mnt/backup/dandi/dandisets/.git/dandi/backups2datalad/2022.07.26.15.40.36Z.log
2022-07-26T11:40:40-0400 [WARNING ] dandi: A newer version (0.45.1) of dandi/dandi-cli is available. You are using 0.40.0
2022-07-26T11:40:40-0400 [INFO    ] backups2datalad: Dandiset 000026: Syncing
2022-07-26T11:41:10-0400 [INFO    ] backups2datalad: Updating metadata file
2022-07-26T11:41:11-0400 [INFO    ] backups2datalad: Dandiset 000026: Syncing assets...
...
2022-07-26T12:10:32-0400 [DEBUG   ] backups2datalad: Dandiset 000026: sub-KC001/ses-MRI/anat/sub-KC001_ses-MRI_echo-4_flip-4_VFA.nii.gz: metadata unchanged; not taking any further action
2022-07-26T12:10:32-0400 [DEBUG   ] httpx._client: HTTP Request: GET https://api.dandiarchive.org/api/assets/a2f7a01a-1820-4b8a-95ea-3911e6853a26/ "HTTP/1.1 200 OK"
2022-07-26T12:10:32-0400 [INFO    ] backups2datalad: Dandiset 000026: Finished getting assets from API
2022-07-26T12:10:32-0400 [DEBUG   ] backups2datalad: Dandiset 000026: sub-KC001/sub-KC001_sessions.tsv: metadata unchanged; not taking any further action
2022-07-26T12:10:32-0400 [DEBUG   ] backups2datalad: Dandiset 000026: Done feeding URLs to addurl
2022-07-26T12:10:32-0400 [DEBUG   ] backups2datalad: Waiting for `git-annex addurl --batch --with-files --jobs 5 --json --json-error-messages --json-progress --raw` [cwd=/mnt/backup/dandi/dandisets/000026] to terminate
2022-07-26T12:10:32-0400 [DEBUG   ] backups2datalad: Command `git-annex addurl --batch --with-files --jobs 5 --json --json-error-messages --json-progress --raw` [cwd=/mnt/backup/dandi/dandisets/000026] exited with return code 0
2022-07-26T12:10:32-0400 [DEBUG   ] backups2datalad: Dandiset 000026: Done reading from addurl
2022-07-26T12:10:32-0400 [DEBUG   ] backups2datalad: Waiting for `git-annex addurl --batch --with-files --jobs 5 --json --json-error-messages --json-progress --raw` [cwd=/mnt/backup/dandi/dandisets/000026] to terminate
2022-07-26T12:10:32-0400 [DEBUG   ] backups2datalad: Command `git-annex addurl --batch --with-files --jobs 5 --json --json-error-messages --json-progress --raw` [cwd=/mnt/backup/dandi/dandisets/000026] exited with return code 0
2022-07-26T12:11:34-0400 [INFO    ] backups2datalad: Dandiset 000026: No assets downloaded for this version segment; not committing
2022-07-26T12:11:34-0400 [INFO    ] backups2datalad: Dandiset 000026: Asset sync complete!
2022-07-26T12:11:34-0400 [INFO    ] backups2datalad: Dandiset 000026: 0 assets added
2022-07-26T12:11:34-0400 [INFO    ] backups2datalad: Dandiset 000026: 0 assets updated
2022-07-26T12:11:34-0400 [INFO    ] backups2datalad: Dandiset 000026: 0 assets registered
2022-07-26T12:11:34-0400 [INFO    ] backups2datalad: Dandiset 000026: 0 assets successfully downloaded
2022-07-26T12:12:28-0400 [DEBUG   ] backups2datalad: Dandiset 000026: Checking whether repository is dirty ...
2022-07-26T12:15:29-0400 [DEBUG   ] backups2datalad: Dandiset 000026: Repository is clean
2022-07-26T12:15:29-0400 [INFO    ] backups2datalad: Dandiset 000026: No changes made to repository
2022-07-26T12:15:29-0400 [DEBUG   ] backups2datalad: Dandiset 000026: Running `git gc`
2022-07-26T12:15:29-0400 [DEBUG   ] backups2datalad: Running: git gc [cwd=/mnt/backup/dandi/dandisets/000026]

and still have them listed in assets.json

(dandi-cli-latest) dandi@drogon:/mnt/backup/dandi/dandisets/000026$ grep rawdata .dandi/assets.json  | head -n 1
"rawdata/dataset_description.json",
(dandi-cli-latest) dandi@drogon:/mnt/backup/dandi/dandisets/000026$ ls -l rawdata
ls: cannot access 'rawdata': No such file or directory

I am not quite sure why the run didn't detect that it needs saving an updated assets.json (may be some optimization or may be some bug to fix) but I think we would benefit from an option to ensure that we have any dandiset we process in consistent - matching the server state.

jwodder commented 2 years ago

@yarikoptic The script is already supposed to align assets.json and the set of local files with what's on the server. I don't know why that's not happening (the fact that assets.json wasn't updated implies the problem wasn't with datalad save breaking up the command), and unless you're proposing a different method of operation, a "force consistency" option wouldn't make a difference.

jwodder commented 2 years ago

@yarikoptic Let me describe how the asset backup works (ignoring the behavior around handling of published versions):

Is there any part of this procedure that you want to change?

I don't know why rawdata/ wasn't removed automatically when it should have been, but since you (I assume) manually deleted the folder, the assets currently won't get deleted from .dandi/assets.json without manual editing.

yarikoptic commented 2 years ago

To the 2nd step (retrieves each asset from the API) add collect paths of remote assets into "remote_paths".

Then before Then, everything still in asset_metadata is dumped to .dandi/assets.json. in the last step add remove from asset_metadata any asset path of which is not present in remote_paths.

yarikoptic commented 2 years ago

Commit message for such a change (if any asset is removed from asset_metadata) should include number of such assets which were "garbage collected from local listing" or smth like that