dandi / dandisets

735 Dandisets, 812.2 TB total. DataLad super-dataset of all Dandisets from https://github.com/dandisets
10 stars 0 forks source link

syncing of zarr seems to not remove some files #305

Closed yarikoptic closed 1 year ago

yarikoptic commented 1 year ago

backup of 000108 has failed with a similar RuntimeError, I verified that indeed remote checksum (returned by API server for zarr) differs from local (computed), decided to sync explicitly but it seems it is not doing compete job since it say that 255 files local files to be removed, but then I guess nothing gets removed since next run says the same

(dandisets) dandi@drogon:/mnt/backup/dandi/dandisets$ python tools/sync_zarrs_by_id.py 000108 /mnt/backup/dandi/dandizarrs 2a613dd6-73c7-45f7-9b17-eae495c26277
...
RuntimeError: Zarr 2a613dd6-73c7-45f7-9b17-eae495c26277: local checksum '55ddd70fb60f8a5eff6d577e638fe5c5-36975--73839385272' differs from remote checksum '8bd6184d20431e29349bcb41261cb08f-37230--74226368012' after backup, and no change on server was detected
(dandisets) dandi@drogon:/mnt/backup/dandi/dandisets$ python tools/sync_zarrs_by_id.py 000108 /mnt/backup/dandi/dandizarrs 2a613dd6-73c7-45f7-9b17-eae495c26277
Working on 1 for dandiset 000108 under /mnt/backup/dandi/dandizarrs
SYNCING 2a613dd6-73c7-45f7-9b17-eae495c26277 (sub-MITU01/ses-20220318h15m33s47/micr/sub-MITU01_ses-20220318h15m33s47_sample-16_stain-LEC_run-1_chunk-7_SPIM.ome.zarr)
INFO:backups2datalad:255 files in local backup but no longer on server
INFO:backups2datalad:sync needed
...
jwodder commented 1 year ago

@yarikoptic Looking at the Zarr backup in question, I see that 255 files are staged for removal. It's likely that some error in an earlier run cancelled the backup before the commit could take place, although I'm not entirely sure why syncing later didn't commit. Can you post the full logs from a run of the sync command for this Zarr?

yarikoptic commented 1 year ago
The original run of the backup which crashed with ```shell (dandisets) dandi@drogon:/mnt/backup/dandi/dandisets$ time python -m tools.backups2datalad -l INFO --backup-root /mnt/backup/dandi --config tools/backups2datalad.cfg.yaml update-from-backup --workers 2 000108 ... 2022-11-18T19:09:27-0500 [INFO ] backups2datalad: Dandiset 000108: Zarr 79760495-3796-456d-a0a6-a626c8681723: 0/0/0/5/0/83: Syncing 2022-11-18T19:09:27-0500 [INFO ] backups2datalad: Dandiset 000108: Zarr 79760495-3796-456d-a0a6-a626c8681723: 0/0/0/5/0/83: Not in dataset; will add 2022-11-18T19:09:27-0500 [INFO ] backups2datalad: Dandiset 000108: Zarr 8054d554-4fdc-4dd4-932c-8f2fc15076a0: 0/0/0/1/1/10: Registering URL https://api.dandiarchive.org/api/zarr/8054d554-4fdc-4dd4-932c-8f2fc15076a0.zarr/0/0/0/1/1/10 2022-11-18T19:09:27-0500 [INFO ] backups2datalad: Dandiset 000108: Zarr 48d7ec90-ccc7-4254-afe9-cf5bbdde88ad: 0/0/0/1/0/97: Registering URL https://api.dandiarchive.org/api/zarr/48d7ec90-ccc7-4254-afe9-cf5bbdde88ad.zarr/0/0/0/1/0/97 2022-11-18T19:09:27-0500 [INFO ] backups2datalad: Dandiset 000108: Zarr a77e1f73-36f5-4321-b1cc-ed27349fd461: 0/0/0/0/1/103: Registering URL https://api.dandiarchive.org/api/zarr/a77e1f73-36f5-4321-b1cc-ed27349fd461.zarr/0/0/0/0/1/103 2022-11-18T19:09:29-0500 [INFO ] backups2datalad: Dandiset 000108: Zarr ca44da08-c398-4536-abe1-8a258d7898bc: 0/0/0/1/1/0: Registering URL https://dandiarchive.s3.amazonaws.com/zarr/ca44da08-c398-4536-abe1-8a258d7898bc/0/0/0/1/1/0?versionId=rCKPr0TGQs6X7kL0MxSFLaHv.fMk0X7K 2022-11-18T19:09:29-0500 [INFO ] backups2datalad: Dandiset 000108: Zarr 5a0d5fbf-d061-447b-a020-2e9f8e58ed7e: 0/0/0/0/1/103: Registering URL https://dandiarchive.s3.amazonaws.com/zarr/5a0d5fbf-d061-447b-a020-2e9f8e58ed7e/0/0/0/0/1/103?versionId=25Hm2MmnOIvACPHB_AUURqndtFbDGrcI whereis: 229 failed whereis: 224 failed whereis: 233 failed whereis: 228 failed whereis: 208 failed whereis: 226 failed whereis: 233 failed Traceback (most recent call last): File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/mnt/backup/dandi/dandisets/tools/backups2datalad/__main__.py", line 499, in main(_anyio_backend="asyncio") File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/asyncclick/core.py", line 1157, in __call__ return anyio.run(self._main, main, args, kwargs, **opts) File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/anyio/_core/_eventloop.py", line 70, in run return asynclib.run(func, *args, **backend_options) File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 292, in run return native_run(wrapper(), debug=debug) File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete return future.result() File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 287, in wrapper return await func(*args) File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/asyncclick/core.py", line 1160, in _main return await main(*args, **kwargs) File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/asyncclick/core.py", line 1076, in main rv = await self.invoke(ctx) File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/asyncclick/core.py", line 1687, in invoke return await _process_result(await sub_ctx.command.invoke(sub_ctx)) File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/asyncclick/core.py", line 1434, in invoke return await ctx.invoke(self.callback, **ctx.params) File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/asyncclick/core.py", line 780, in invoke rv = await rv File "/mnt/backup/dandi/dandisets/tools/backups2datalad/__main__.py", line 185, in update_from_backup await datasetter.update_from_backup(dandisets, exclude=exclude) File "/mnt/backup/dandi/dandisets/tools/backups2datalad/datasetter.py", line 78, in update_from_backup report = await pool_amap( File "/mnt/backup/dandi/dandisets/tools/backups2datalad/aioutil.py", line 184, in pool_amap await sender.send(item) File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 662, in __aexit__ raise exceptions[0] File "/mnt/backup/dandi/dandisets/tools/backups2datalad/aioutil.py", line 170, in dowork outp = await func(inp) File "/mnt/backup/dandi/dandisets/tools/backups2datalad/datasetter.py", line 145, in update_dandiset changed = await self.sync_dataset(dandiset, ds, dmanager) File "/mnt/backup/dandi/dandisets/tools/backups2datalad/datasetter.py", line 188, in sync_dataset await syncer.sync_assets(error_on_change) File "/mnt/backup/dandi/dandisets/tools/backups2datalad/syncer.py", line 36, in sync_assets self.report = await async_assets( File "/mnt/backup/dandi/dandisets/tools/backups2datalad/asyncer.py", line 500, in async_assets nursery.start_soon(dm.read_addurl) File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 660, in __aexit__ raise ExceptionGroup(exceptions) anyio._backends._asyncio.ExceptionGroup: 3 exceptions were raised in the task group: ---------------------------- Traceback (most recent call last): File "/mnt/backup/dandi/dandisets/tools/backups2datalad/zarr.py", line 528, in sync_zarr await zsync.run() File "/mnt/backup/dandi/dandisets/tools/backups2datalad/zarr.py", line 260, in run raise RuntimeError( RuntimeError: Zarr 2a613dd6-73c7-45f7-9b17-eae495c26277: local checksum '55ddd70fb60f8a5eff6d577e638fe5c5-36975--73839385272' differs from remote checksum '8bd6184d20431e29349bcb41261cb08f-37230--74226368012' after backup, and no change on server was detected ---------------------------- Traceback (most recent call last): File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/anyio/_core/_subprocesses.py", line 83, in run_process await process.wait() File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 662, in __aexit__ raise exceptions[0] File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 733, in task_done exc = _task.exception() asyncio.exceptions.CancelledError During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/mnt/backup/dandi/dandisets/tools/backups2datalad/zarr.py", line 554, in sync_zarr stats = await ds.get_stats(config=manager.config) File "/mnt/backup/dandi/dandisets/tools/backups2datalad/adataset.py", line 357, in get_stats stored_stats = await self.get_stored_stats() File "/mnt/backup/dandi/dandisets/tools/backups2datalad/adataset.py", line 406, in get_stored_stats if stored_commit == await self.get_commit_hash(): File "/mnt/backup/dandi/dandisets/tools/backups2datalad/adataset.py", line 416, in get_commit_hash return await self.read_git("show", "-s", "--format=%H") File "/mnt/backup/dandi/dandisets/tools/backups2datalad/adataset.py", line 120, in read_git return await areadcmd("git", *args, cwd=self.path, **kwargs) File "/mnt/backup/dandi/dandisets/tools/backups2datalad/aioutil.py", line 210, in areadcmd r = await aruncmd(*args, **kwargs) File "/mnt/backup/dandi/dandisets/tools/backups2datalad/aioutil.py", line 204, in aruncmd return await anyio.run_process(argstrs, **kwargs) File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/anyio/_core/_subprocesses.py", line 85, in run_process process.kill() File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 1056, in kill self._process.kill() File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/asyncio/subprocess.py", line 144, in kill self._transport.kill() File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/asyncio/base_subprocess.py", line 153, in kill self._check_proc() File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/asyncio/base_subprocess.py", line 142, in _check_proc raise ProcessLookupError() ProcessLookupError ---------------------------- Traceback (most recent call last): File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/anyio/_core/_subprocesses.py", line 83, in run_process await process.wait() File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 662, in __aexit__ raise exceptions[0] File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 733, in task_done exc = _task.exception() asyncio.exceptions.CancelledError During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/mnt/backup/dandi/dandisets/tools/backups2datalad/zarr.py", line 556, in sync_zarr link.commit_hash = await ds.get_commit_hash() File "/mnt/backup/dandi/dandisets/tools/backups2datalad/adataset.py", line 416, in get_commit_hash return await self.read_git("show", "-s", "--format=%H") File "/mnt/backup/dandi/dandisets/tools/backups2datalad/adataset.py", line 120, in read_git return await areadcmd("git", *args, cwd=self.path, **kwargs) File "/mnt/backup/dandi/dandisets/tools/backups2datalad/aioutil.py", line 210, in areadcmd r = await aruncmd(*args, **kwargs) File "/mnt/backup/dandi/dandisets/tools/backups2datalad/aioutil.py", line 204, in aruncmd return await anyio.run_process(argstrs, **kwargs) File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/anyio/_core/_subprocesses.py", line 85, in run_process process.kill() File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 1056, in kill self._process.kill() File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/asyncio/subprocess.py", line 144, in kill self._transport.kill() File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/asyncio/base_subprocess.py", line 153, in kill self._check_proc() File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/asyncio/base_subprocess.py", line 142, in _check_proc raise ProcessLookupError() ProcessLookupError Exception ignored in: Traceback (most recent call last): File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/asyncio/base_subprocess.py", line 126, in __del__ self.close() File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/asyncio/base_subprocess.py", line 104, in close proto.pipe.close() File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/asyncio/unix_events.py", line 536, in close self._close(None) File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/asyncio/unix_events.py", line 560, in _close self._loop.call_soon(self._call_connection_lost, exc) File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/asyncio/base_events.py", line 719, in call_soon self._check_closed() File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/asyncio/base_events.py", line 508, in _check_closed raise RuntimeError('Event loop is closed') RuntimeError: Event loop is closed real 223m38.682s user 159m49.706s sys 16m23.262s ```

is /mnt/backup/dandi/dandisets/.git/dandi/backups2datalad/2022.11.18.20.26.11Z.log according to

(dandisets) dandi@drogon:/mnt/backup/dandi/dandisets$ grep -l '2022-11-18T19:09:29-0500' /mnt/backup/dandi/dandisets/.git/dandi/backups2datalad/2022.11.1[78]*
/mnt/backup/dandi/dandisets/.git/dandi/backups2datalad/2022.11.18.20.26.11Z.log

but the only errors are

(dandisets) dandi@drogon:/mnt/backup/dandi/dandisets$ grep Error /mnt/backup/dandi/dandisets/.git/dandi/backups2datalad/2022.11.18.20.26.11Z.log
2022-11-18T16:06:08-0500 [WARNING ] backups2datalad: Retrying GET request to /assets/bca848f3-80a9-4a33-83de-01ea054bdb0b/info/ in 1.011912 seconds as it raised ConnectError: 
2022-11-18T16:06:42-0500 [WARNING ] backups2datalad: Retrying GET request to /assets/5f9a3846-a550-4b40-91db-bf3c554b9086/info/ in 1.036465 seconds as it raised ConnectError: 
2022-11-18T16:15:26-0500 [WARNING ] backups2datalad: Retrying GET request to /assets/aa43ba5a-4caf-4af4-9cc5-7c3d8a3c9a28/info/ in 1.992614 seconds as it raised RemoteProtocolError: Server disconnected without sending a response.
2022-11-18T16:15:27-0500 [WARNING ] backups2datalad: Retrying GET request to /assets/53760ec9-bb3c-4211-b14a-4d73b9ae7def/info/ in 1.022897 seconds as it raised RemoteProtocolError: Server disconnected without sending a response.

but may be the "not committed removals" happened even before that somehow... the prior log files mentioning that zarr were following

(dandisets) dandi@drogon:/mnt/backup/dandi/dandisets/.git/dandi$ grep -l 2a613dd6-73c7-45f7-9b17-eae495c26277 /mnt/backup/dandi/dandisets/.git/dandi/backups2datalad/2022.11.1[5678]*
/mnt/backup/dandi/dandisets/.git/dandi/backups2datalad/2022.11.15.19.19.19Z.log
/mnt/backup/dandi/dandisets/.git/dandi/backups2datalad/2022.11.16.18.39.33Z.log
/mnt/backup/dandi/dandisets/.git/dandi/backups2datalad/2022.11.18.20.26.11Z.log

but they do not have any Errors (only the ones I listed above for the last one are there):

(dandisets) dandi@drogon:/mnt/backup/dandi/dandisets/.git/dandi$ grep -l 2a613dd6-73c7-45f7-9b17-eae495c26277 /mnt/backup/dandi/dandisets/.git/dandi/backups2datalad/2022.11.1[5678]* | xargs grep Error
/mnt/backup/dandi/dandisets/.git/dandi/backups2datalad/2022.11.18.20.26.11Z.log:2022-11-18T16:06:08-0500 [WARNING ] backups2datalad: Retrying GET request to /assets/bca848f3-80a9-4a33-83de-01ea054bdb0b/info/ in 1.011912 seconds as it raised ConnectError: 
/mnt/backup/dandi/dandisets/.git/dandi/backups2datalad/2022.11.18.20.26.11Z.log:2022-11-18T16:06:42-0500 [WARNING ] backups2datalad: Retrying GET request to /assets/5f9a3846-a550-4b40-91db-bf3c554b9086/info/ in 1.036465 seconds as it raised ConnectError: 
/mnt/backup/dandi/dandisets/.git/dandi/backups2datalad/2022.11.18.20.26.11Z.log:2022-11-18T16:15:26-0500 [WARNING ] backups2datalad: Retrying GET request to /assets/aa43ba5a-4caf-4af4-9cc5-7c3d8a3c9a28/info/ in 1.992614 seconds as it raised RemoteProtocolError: Server disconnected without sending a response.
/mnt/backup/dandi/dandisets/.git/dandi/backups2datalad/2022.11.18.20.26.11Z.log:2022-11-18T16:15:27-0500 [WARNING ] backups2datalad: Retrying GET request to /assets/53760ec9-bb3c-4211-b14a-4d73b9ae7def/info/ in 1.022897 seconds as it raised RemoteProtocolError: Server disconnected without sending a response.
jwodder commented 1 year ago

@yarikoptic I have discarded all uncommitted changes to the Zarr backup in question. The next run should properly update it. I have also created #308 to error if a Zarr is dirty when starting to sync it, like we do for Dandisets.

yarikoptic commented 1 year ago

well, I merged PRs, reran the sync script

$ (dandisets) dandi@drogon:/mnt/backup/dandi/dandisets$ python tools/sync_zarrs_by_id.py 000108 /mnt/backup/dandi/dandizarrs 2a613dd6-73c7-45f7-9b17-eae495c26277
Working on 1 for dandiset 000108 under /mnt/backup/dandi/dandizarrs
SYNCING 2a613dd6-73c7-45f7-9b17-eae495c26277 (sub-MITU01/ses-20220318h15m33s47/micr/sub-MITU01_ses-20220318h15m33s47_sample-16_stain-LEC_run-1_chunk-7_SPIM.ome.zarr)
INFO:backups2datalad:255 files in local backup but no longer on server
INFO:backups2datalad:sync needed
...
INFO:backups2datalad:deleting 0/0/0/15/4/159
INFO:backups2datalad:finished deleting extra files
Traceback (most recent call last):
  File "tools/sync_zarrs_by_id.py", line 49, in <module>
    anyio.run(amain)
  File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/anyio/_core/_eventloop.py", line 70, in run
    return asynclib.run(func, *args, **backend_options)
  File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 292, in run
    return native_run(wrapper(), debug=debug)
  File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 287, in wrapper
    return await func(*args)
  File "tools/sync_zarrs_by_id.py", line 35, in amain
    await sync_zarr(
  File "/mnt/backup/dandi/dandisets/tools/backups2datalad/zarr.py", line 533, in sync_zarr
    await zsync.run()
  File "/mnt/backup/dandi/dandisets/tools/backups2datalad/zarr.py", line 260, in run
    raise RuntimeError(
RuntimeError: Zarr 2a613dd6-73c7-45f7-9b17-eae495c26277: local checksum '55ddd70fb60f8a5eff6d577e638fe5c5-36975--73839385272' differs from remote checksum '8bd6184d20431e29349bcb41261cb08f-37230--74226368012' after backup, and no change on server was detected

and it is dirty now again


(dandisets) dandi@drogon:/mnt/backup/dandi/dandisets$ git -C ../dandizarrs/2a613dd6-73c7-45f7-9b17-eae495c26277 status | head
On branch draft
Your branch is up to date with 'github/draft'.

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
        deleted:    0/0/0/15/12/105
        deleted:    0/0/0/15/12/116
        deleted:    0/0/0/15/12/124
        deleted:    0/0/0/15/12/137
        deleted:    0/0/0/15/12/140

so the issue was not resolved.

jwodder commented 1 year ago

@yarikoptic It's dirty because the checksum mismatch error was raised before committing. Exactly what issue do you see remaining?

yarikoptic commented 1 year ago

the issue that backup of that zarr fails. I see two possible reasons:

Is there any other -- please add. In either case please figure out which is the cause for the issue and report (e.g. against dandi-archive) or fix it.

jwodder commented 1 year ago

@yarikoptic

zarr checksum in the archive is wrong

This is the issue. The files that the script keeps deleting from the Zarr backup are still registered in the Archive despite not being present on S3. I've reported this in https://github.com/dandi/dandi-archive/issues/1378.

yarikoptic commented 1 year ago

THANK YOU!

yarikoptic commented 1 year ago

that issue in dandi-archive was claimed to be addressed, I am running the backup of 000108 -- let's see how well it finishes up

yarikoptic commented 1 year ago
the most recent run of the backup for 000108 failed similarly ```shell whereis: 6791 failed whereis: 25475 failed whereis: 7376 failed 2023-01-10T02:04:16-0500 [ERROR ] backups2datalad: Job failed on input : Traceback (most recent call last): File "/mnt/backup/dandi/dandisets/tools/backups2datalad/aioutil.py", line 168, in dowork outp = await func(inp) File "/mnt/backup/dandi/dandisets/tools/backups2datalad/datasetter.py", line 145, in update_dandiset changed = await self.sync_dataset(dandiset, ds, dmanager) File "/mnt/backup/dandi/dandisets/tools/backups2datalad/datasetter.py", line 188, in sync_dataset await syncer.sync_assets(error_on_change) File "/mnt/backup/dandi/dandisets/tools/backups2datalad/syncer.py", line 36, in sync_assets self.report = await async_assets( File "/mnt/backup/dandi/dandisets/tools/backups2datalad/asyncer.py", line 500, in async_assets nursery.start_soon(dm.read_addurl) File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 662, in __aexit__ raise exceptions[0] File "/mnt/backup/dandi/dandisets/tools/backups2datalad/zarr.py", line 537, in sync_zarr await zsync.run() File "/mnt/backup/dandi/dandisets/tools/backups2datalad/zarr.py", line 264, in run raise RuntimeError( RuntimeError: Zarr 9fc86864-2ccf-428e-8605-7130425f223b: local checksum '68e0a6f07d4ed53cf473124c50c0e93c-26689--26322880109' differs from remote checksum '1a413201ba27b8bc991702b1f80ac8b4-26940--26481139599' after backup, and no change on server was detected Traceback (most recent call last): File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/mnt/backup/dandi/dandisets/tools/backups2datalad/__main__.py", line 513, in main(_anyio_backend="asyncio") File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/asyncclick/core.py", line 1157, in __call__ return anyio.run(self._main, main, args, kwargs, **opts) File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/anyio/_core/_eventloop.py", line 70, in run return asynclib.run(func, *args, **backend_options) File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 292, in run return native_run(wrapper(), debug=debug) File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete return future.result() File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 287, in wrapper return await func(*args) File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/asyncclick/core.py", line 1160, in _main return await main(*args, **kwargs) File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/asyncclick/core.py", line 1076, in main rv = await self.invoke(ctx) File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/asyncclick/core.py", line 1687, in invoke return await _process_result(await sub_ctx.command.invoke(sub_ctx)) File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/asyncclick/core.py", line 1434, in invoke return await ctx.invoke(self.callback, **ctx.params) File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/asyncclick/core.py", line 780, in invoke rv = await rv File "/mnt/backup/dandi/dandisets/tools/backups2datalad/__main__.py", line 186, in update_from_backup await datasetter.update_from_backup(dandisets, exclude=exclude) File "/mnt/backup/dandi/dandisets/tools/backups2datalad/datasetter.py", line 97, in update_from_backup raise RuntimeError( RuntimeError: Backups for 1 Dandiset failed real 5272m47.292s user 1594m57.118s sys 1761m47.533s ```

@jwodder -- please trace it down/report (if still dandi-archive issue) or address (if of backup script somehow) .

jwodder commented 1 year ago

@yarikoptic I don't know what happened, but I was able to back up the Zarr successfully by cleaning the dataset and running sync_zarrs_by_id.py on it.

yarikoptic commented 1 year ago

Ok. But we know that something is not "right" and need to figure out what since we keep hitting this issue: should we log some extra information or what else? or could you look into the log which was produced for that run - may be would spot some relevant information?

meanwhile -- I will see which other zarrs are dirty if any and reset them.

jwodder commented 1 year ago

@yarikoptic Keep hitting what issue, exactly? There seems to be a different problem every time.

yarikoptic commented 1 year ago

in November above

RuntimeError: Zarr 2a613dd6-73c7-45f7-9b17-eae495c26277: local checksum '55ddd70fb60f8a5eff6d577e638fe5c5-36975--73839385272' differs from remote checksum '8bd6184d20431e29349bcb41261cb08f-37230--74226368012' after backup, and no change on server was detected

now:

RuntimeError: Zarr 9fc86864-2ccf-428e-8605-7130425f223b: local checksum '68e0a6f07d4ed53cf473124c50c0e93c-26689--26322880109' differs from remote checksum '1a413201ba27b8bc991702b1f80ac8b4-26940--26481139599' after backup, and no change on server was detected

although indeed might be different from the original filing title issue.

jwodder commented 1 year ago

@yarikoptic The issue in November was caused by a problem on the Archive's end that should now be resolved. Running the script from that issue on the most recent Zarr shows no Archive-S3 discrepancies.

Inspecting the logs for the failed run of the most recent Zarr, the only thing of note I see is that some entries which currently exist in the Zarr on the Archive (and which are now in the backup) make no appearance in the logs. My best guess is that these entries were added to the Zarr while the backup was in progress, leading to the checksum mismatch, but then the Zarr asset's modified timestamp should have been updated during the backup as well, yet the "no change on server was detected" in the error message indicates that it was not.

yarikoptic commented 1 year ago

So that means that modified is not adjusted on dandi-archive while some changes happen to zarr. I've filed https://github.com/dandi/dandi-archive/issues/1432 so we see it resolved there. But is there anything we can do here may be to avoid reliance on modified in this scenario?

jwodder commented 1 year ago

@yarikoptic We use the modified timestamp to check whether the Zarr changed while we were backing it up, and that influences whether we warn vs. crash if there's a mismatch between the local and remote Zarr checksums. Unless you want to change that behavior, I don't see any alternatives to relying on the modified timestamp.

yarikoptic commented 1 year ago

Since we know that there is an issue with modified on the server side, let's