dandi / dandisets

735 Dandisets, 812.2 TB total. DataLad super-dataset of all Dandisets from https://github.com/dandisets
10 stars 0 forks source link

tools/sync_zarrs_by_id.py doesn't commit modified .dandi/s3sync.json #321

Closed yarikoptic closed 1 year ago

yarikoptic commented 1 year ago

we kept ending up with dirty zarrs, I did

grep dirty /tmp/dirty-zarrs-4.log | head -n 4 | awk '{print $1;}'  | sort | uniq | sed -e 's,^\./,,g' | xargs  python tools/sync_zarrs_by_id.py 000108 /mnt/backup/dandi/dandizarrs/

to get only 4 synced, and apparently they all 4 ended up dirty with modified: .dandi/s3sync.json only. I guess it might be manifestation of replacing datalad save with a commit at some point in the past to overcome some limitations....

edit: note -- the change is staged to be committed and just not committed.

yarikoptic commented 1 year ago

a little of information, one of such backups looks like

SYNCING 16f462a3-bf33-44f7-9611-a9ed5aa172a2 (sub-SChmi53/ses-20220916h10m54s03/micr/sub-SChmi53_ses-20220916h10m54s03_sample-SChmi53_32_stain-LEC_run-1_chunk-4_SPIM.ome.zarr)
INFO:backups2datalad:.zattrs was modified on server at 2023-01-15 04:01:20+00:00, after last sync at 2022-12-10 09:26:45+00:00
INFO:backups2datalad:sync needed
INFO:backups2datalad:deleting extra files
INFO:backups2datalad:finished deleting extra files
add .dandi/s3sync.json (non-large file; adding content to git repository) ok
(recording state in git...)
INFO:backups2datalad:No changes to zarr content, some other changes; committing
[draft 80ebea27e] [backups2datalad] No changes to zarr content, some other changes
 Author: DANDI User <info@dandiarchive.org>
 1 file changed, 1 insertion(+), 1 deletion(-)
Enumerating objects: 110292, done.
Counting objects: 100% (110292/110292), done.
Delta compression using up to 4 threads
Compressing objects: 100% (85103/85103), done.
Writing objects: 100% (110292/110292), done.
Total 110292 (delta 25145), reused 110285 (delta 25142), pack-reused 0

so, .zattrs was claimed to be modified but no changes besides timestamp change was committed. Looking at history of that file:

(dandisets) dandi@drogon:/mnt/backup/dandi/dandisets/tools$ datalad ls -L --list-content md5 s3://dandiarchive/zarr/16f462a3-bf33-44f7-9611-a9ed5aa172a2/.zattrs
Connecting to bucket: dandiarchive
[INFO   ] S3 session: Connecting to the bucket dandiarchive with authentication
Bucket info:
  Versioning: {'Versioning': 'Enabled'}
     Website: dandiarchive.s3-website-us-east-1.amazonaws.com
         ACL: <Policy: None (owner) = FULL_CONTROL>
zarr/16f462a3-bf33-44f7-9611-a9ed5aa172a2/.zattrs 2023-01-15T04:01:20.000Z 8342 ver:hb8753SqXRIxkGOeVMJcwdZjB1eRGkdU  acl:<Policy: None (owner) = FULL_CONTROL>  http://dandiarchive.s3.amazonaws.com/zarr/16f462a3-bf33-44f7-9611-a9ed5aa172a2/.zattrs?versionId=hb8753SqXRIxkGOeVMJcwdZjB1eRGkdU [OK] a068da497ed0a3bacd2c531dea59fc2b
zarr/16f462a3-bf33-44f7-9611-a9ed5aa172a2/.zattrs 2023-01-14T21:20:31.000Z 8342 ver:g4Os5bdDJrHAAqylFrPT8Qyp_hXuqStI  acl:<Policy: None (owner) = FULL_CONTROL>  http://dandiarchive.s3.amazonaws.com/zarr/16f462a3-bf33-44f7-9611-a9ed5aa172a2/.zattrs?versionId=g4Os5bdDJrHAAqylFrPT8Qyp_hXuqStI [OK] a068da497ed0a3bacd2c531dea59fc2b
zarr/16f462a3-bf33-44f7-9611-a9ed5aa172a2/.zattrs 2023-01-05T15:06:34.000Z 8342 ver:jxMR68khGs_pQJteb8gU3z0IYYfZWM5V  acl:<Policy: None (owner) = FULL_CONTROL>  http://dandiarchive.s3.amazonaws.com/zarr/16f462a3-bf33-44f7-9611-a9ed5aa172a2/.zattrs?versionId=jxMR68khGs_pQJteb8gU3z0IYYfZWM5V [OK] a068da497ed0a3bacd2c531dea59fc2b
zarr/16f462a3-bf33-44f7-9611-a9ed5aa172a2/.zattrs 2022-12-27T02:50:19.000Z 8342 ver:QoXhc09PE.Dgbj5xDep_T7lOQkJttSBg  acl:<Policy: None (owner) = FULL_CONTROL>  http://dandiarchive.s3.amazonaws.com/zarr/16f462a3-bf33-44f7-9611-a9ed5aa172a2/.zattrs?versionId=QoXhc09PE.Dgbj5xDep_T7lOQkJttSBg [OK] a068da497ed0a3bacd2c531dea59fc2b
zarr/16f462a3-bf33-44f7-9611-a9ed5aa172a2/.zattrs 2022-12-10T07:41:21.000Z 8342 ver:Xp1phdpHXWl6RatWNkgo4fGrdzb_RMp8  acl:<Policy: None (owner) = FULL_CONTROL>  http://dandiarchive.s3.amazonaws.com/zarr/16f462a3-bf33-44f7-9611-a9ed5aa172a2/.zattrs?versionId=Xp1phdpHXWl6RatWNkgo4fGrdzb_RMp8 [OK] a068da497ed0a3bacd2c531dea59fc2b

we have many versions for the same content! so indeed there were no changes - file was just constantly reuploaded ... hm - may be we should prevent reuploading if content is known to correspond based on local md5 == ETag?