Closed yarikoptic closed 1 year ago
@yarikoptic Looking at the Zarr backup in question, I see that 255 files are staged for removal. It's likely that some error in an earlier run cancelled the backup before the commit could take place, although I'm not entirely sure why syncing later didn't commit. Can you post the full logs from a run of the sync command for this Zarr?
is /mnt/backup/dandi/dandisets/.git/dandi/backups2datalad/2022.11.18.20.26.11Z.log according to
(dandisets) dandi@drogon:/mnt/backup/dandi/dandisets$ grep -l '2022-11-18T19:09:29-0500' /mnt/backup/dandi/dandisets/.git/dandi/backups2datalad/2022.11.1[78]*
/mnt/backup/dandi/dandisets/.git/dandi/backups2datalad/2022.11.18.20.26.11Z.log
but the only errors are
(dandisets) dandi@drogon:/mnt/backup/dandi/dandisets$ grep Error /mnt/backup/dandi/dandisets/.git/dandi/backups2datalad/2022.11.18.20.26.11Z.log
2022-11-18T16:06:08-0500 [WARNING ] backups2datalad: Retrying GET request to /assets/bca848f3-80a9-4a33-83de-01ea054bdb0b/info/ in 1.011912 seconds as it raised ConnectError:
2022-11-18T16:06:42-0500 [WARNING ] backups2datalad: Retrying GET request to /assets/5f9a3846-a550-4b40-91db-bf3c554b9086/info/ in 1.036465 seconds as it raised ConnectError:
2022-11-18T16:15:26-0500 [WARNING ] backups2datalad: Retrying GET request to /assets/aa43ba5a-4caf-4af4-9cc5-7c3d8a3c9a28/info/ in 1.992614 seconds as it raised RemoteProtocolError: Server disconnected without sending a response.
2022-11-18T16:15:27-0500 [WARNING ] backups2datalad: Retrying GET request to /assets/53760ec9-bb3c-4211-b14a-4d73b9ae7def/info/ in 1.022897 seconds as it raised RemoteProtocolError: Server disconnected without sending a response.
but may be the "not committed removals" happened even before that somehow... the prior log files mentioning that zarr were following
(dandisets) dandi@drogon:/mnt/backup/dandi/dandisets/.git/dandi$ grep -l 2a613dd6-73c7-45f7-9b17-eae495c26277 /mnt/backup/dandi/dandisets/.git/dandi/backups2datalad/2022.11.1[5678]*
/mnt/backup/dandi/dandisets/.git/dandi/backups2datalad/2022.11.15.19.19.19Z.log
/mnt/backup/dandi/dandisets/.git/dandi/backups2datalad/2022.11.16.18.39.33Z.log
/mnt/backup/dandi/dandisets/.git/dandi/backups2datalad/2022.11.18.20.26.11Z.log
but they do not have any Errors (only the ones I listed above for the last one are there):
(dandisets) dandi@drogon:/mnt/backup/dandi/dandisets/.git/dandi$ grep -l 2a613dd6-73c7-45f7-9b17-eae495c26277 /mnt/backup/dandi/dandisets/.git/dandi/backups2datalad/2022.11.1[5678]* | xargs grep Error
/mnt/backup/dandi/dandisets/.git/dandi/backups2datalad/2022.11.18.20.26.11Z.log:2022-11-18T16:06:08-0500 [WARNING ] backups2datalad: Retrying GET request to /assets/bca848f3-80a9-4a33-83de-01ea054bdb0b/info/ in 1.011912 seconds as it raised ConnectError:
/mnt/backup/dandi/dandisets/.git/dandi/backups2datalad/2022.11.18.20.26.11Z.log:2022-11-18T16:06:42-0500 [WARNING ] backups2datalad: Retrying GET request to /assets/5f9a3846-a550-4b40-91db-bf3c554b9086/info/ in 1.036465 seconds as it raised ConnectError:
/mnt/backup/dandi/dandisets/.git/dandi/backups2datalad/2022.11.18.20.26.11Z.log:2022-11-18T16:15:26-0500 [WARNING ] backups2datalad: Retrying GET request to /assets/aa43ba5a-4caf-4af4-9cc5-7c3d8a3c9a28/info/ in 1.992614 seconds as it raised RemoteProtocolError: Server disconnected without sending a response.
/mnt/backup/dandi/dandisets/.git/dandi/backups2datalad/2022.11.18.20.26.11Z.log:2022-11-18T16:15:27-0500 [WARNING ] backups2datalad: Retrying GET request to /assets/53760ec9-bb3c-4211-b14a-4d73b9ae7def/info/ in 1.022897 seconds as it raised RemoteProtocolError: Server disconnected without sending a response.
@yarikoptic I have discarded all uncommitted changes to the Zarr backup in question. The next run should properly update it. I have also created #308 to error if a Zarr is dirty when starting to sync it, like we do for Dandisets.
well, I merged PRs, reran the sync script
$ (dandisets) dandi@drogon:/mnt/backup/dandi/dandisets$ python tools/sync_zarrs_by_id.py 000108 /mnt/backup/dandi/dandizarrs 2a613dd6-73c7-45f7-9b17-eae495c26277
Working on 1 for dandiset 000108 under /mnt/backup/dandi/dandizarrs
SYNCING 2a613dd6-73c7-45f7-9b17-eae495c26277 (sub-MITU01/ses-20220318h15m33s47/micr/sub-MITU01_ses-20220318h15m33s47_sample-16_stain-LEC_run-1_chunk-7_SPIM.ome.zarr)
INFO:backups2datalad:255 files in local backup but no longer on server
INFO:backups2datalad:sync needed
...
INFO:backups2datalad:deleting 0/0/0/15/4/159
INFO:backups2datalad:finished deleting extra files
Traceback (most recent call last):
File "tools/sync_zarrs_by_id.py", line 49, in <module>
anyio.run(amain)
File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/anyio/_core/_eventloop.py", line 70, in run
return asynclib.run(func, *args, **backend_options)
File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 292, in run
return native_run(wrapper(), debug=debug)
File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
return future.result()
File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 287, in wrapper
return await func(*args)
File "tools/sync_zarrs_by_id.py", line 35, in amain
await sync_zarr(
File "/mnt/backup/dandi/dandisets/tools/backups2datalad/zarr.py", line 533, in sync_zarr
await zsync.run()
File "/mnt/backup/dandi/dandisets/tools/backups2datalad/zarr.py", line 260, in run
raise RuntimeError(
RuntimeError: Zarr 2a613dd6-73c7-45f7-9b17-eae495c26277: local checksum '55ddd70fb60f8a5eff6d577e638fe5c5-36975--73839385272' differs from remote checksum '8bd6184d20431e29349bcb41261cb08f-37230--74226368012' after backup, and no change on server was detected
and it is dirty now again
(dandisets) dandi@drogon:/mnt/backup/dandi/dandisets$ git -C ../dandizarrs/2a613dd6-73c7-45f7-9b17-eae495c26277 status | head
On branch draft
Your branch is up to date with 'github/draft'.
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
deleted: 0/0/0/15/12/105
deleted: 0/0/0/15/12/116
deleted: 0/0/0/15/12/124
deleted: 0/0/0/15/12/137
deleted: 0/0/0/15/12/140
so the issue was not resolved.
@yarikoptic It's dirty because the checksum mismatch error was raised before committing. Exactly what issue do you see remaining?
the issue that backup of that zarr fails. I see two possible reasons:
Is there any other -- please add. In either case please figure out which is the cause for the issue and report (e.g. against dandi-archive) or fix it.
@yarikoptic
zarr checksum in the archive is wrong
This is the issue. The files that the script keeps deleting from the Zarr backup are still registered in the Archive despite not being present on S3. I've reported this in https://github.com/dandi/dandi-archive/issues/1378.
THANK YOU!
that issue in dandi-archive was claimed to be addressed, I am running the backup of 000108 -- let's see how well it finishes up
@jwodder -- please trace it down/report (if still dandi-archive issue) or address (if of backup script somehow) .
@yarikoptic I don't know what happened, but I was able to back up the Zarr successfully by cleaning the dataset and running sync_zarrs_by_id.py
on it.
Ok. But we know that something is not "right" and need to figure out what since we keep hitting this issue: should we log some extra information or what else? or could you look into the log which was produced for that run - may be would spot some relevant information?
meanwhile -- I will see which other zarrs are dirty if any and reset them.
@yarikoptic Keep hitting what issue, exactly? There seems to be a different problem every time.
RuntimeError: Zarr 2a613dd6-73c7-45f7-9b17-eae495c26277: local checksum '55ddd70fb60f8a5eff6d577e638fe5c5-36975--73839385272' differs from remote checksum '8bd6184d20431e29349bcb41261cb08f-37230--74226368012' after backup, and no change on server was detected
now:
RuntimeError: Zarr 9fc86864-2ccf-428e-8605-7130425f223b: local checksum '68e0a6f07d4ed53cf473124c50c0e93c-26689--26322880109' differs from remote checksum '1a413201ba27b8bc991702b1f80ac8b4-26940--26481139599' after backup, and no change on server was detected
although indeed might be different from the original filing title issue.
@yarikoptic The issue in November was caused by a problem on the Archive's end that should now be resolved. Running the script from that issue on the most recent Zarr shows no Archive-S3 discrepancies.
Inspecting the logs for the failed run of the most recent Zarr, the only thing of note I see is that some entries which currently exist in the Zarr on the Archive (and which are now in the backup) make no appearance in the logs. My best guess is that these entries were added to the Zarr while the backup was in progress, leading to the checksum mismatch, but then the Zarr asset's modified
timestamp should have been updated during the backup as well, yet the "no change on server was detected" in the error message indicates that it was not.
So that means that modified
is not adjusted on dandi-archive while some changes happen to zarr. I've filed https://github.com/dandi/dandi-archive/issues/1432 so we see it resolved there. But is there anything we can do here may be to avoid reliance on modified
in this scenario?
@yarikoptic We use the modified
timestamp to check whether the Zarr changed while we were backing it up, and that influences whether we warn vs. crash if there's a mismatch between the local and remote Zarr checksums. Unless you want to change that behavior, I don't see any alternatives to relying on the modified
timestamp.
Since we know that there is an issue with modified
on the server side, let's
modified
difference - always warn, do not crash, but then do another run of syncing that zarr. Allow for up to e.g. 5 such runs, and if 5th fails similarly, meaning that it is not something intermittent -- we error out.
backup of 000108 has failed with a similar RuntimeError, I verified that indeed remote checksum (returned by API server for zarr) differs from local (computed), decided to sync explicitly but it seems it is not doing compete job since it say that 255 files local files to be removed, but then I guess nothing gets removed since next run says the same