Open yarikoptic opened 9 months ago
and there is a full log : http://www.oneukrainian.com/tmp/check_zarr_paths.log containing 395 records like
❯ head check_zarr_paths.log
00c111e3-3dac-4e4e-9ca0-1ee1a1a67059/: has name='sub-SChmi53/ses-20220923h13m25s47/micr/sub-SChmi53_ses-20220923h13m25s47_sample-SChmi53_28_stain-NN_run-1_chunk-5_SPIM.ome.zarr' but there is no asset at that path in 000108
00d8ce95-90d3-4fd0-a8b0-61e92d985720/: asset for the name='sub-SChmi53/ses-20211209h15m28s04/micr/sub-SChmi53_ses-20211209h15m28s04_sample-15_stain-LEC_run-1_chunk-6_SPIM.ome.zarr' has asset zarr '5bf2bd49-490e-4212-bd13-dca3f3399d65' in 000108
012511dd-72e1-4e28-80b4-8dedbeaf8462/: has name='sub-SChmi53/ses-20220916h10m54s03/micr/sub-SChmi53_ses-20220916h10m54s03_sample-SChmi53_32_stain-LEC_run-1_chunk-5_SPIM.ome.zarr' but there is no asset at that path in 000108
01911d14-a272-49d2-8556-b1333e5ef748/: has name='sub-SChmi53/ses-20211209h15m28s04/micr/sub-SChmi53_ses-20211209h15m28s04_sample-SChmi53_15_stain-YO_run-1_chunk-8_SPIM.ome.zarr' but there is no asset at that path in 000108
03cf98fe-0e9c-4180-bfa6-f1b4ef9471d3/: has name='sub-MITU01/ses-20220309h18m20s33/micr/sub-MITU01h3_ses-20220309h18m20s33_sample-17_stain-NN_run-1_chunk-6_SPIM.ome.zarr' but there is no asset at that path in 000108
03e43250-6de1-4e7a-aa5c-81f2e9662a2e/: has name='sub-SChmi53/ses-20220927h15m28s55/micr/sub-SChmi53_ses-20220927h15m28s55_sample-SChmi53_28_stain-YO_run-1_chunk-1_SPIM.ome.zarr' but there is no asset at that path in 000108
03f822dc-9f10-4454-bedf-0919cfb83a20/: has name='sub-SChmi53/ses-20220927h13m20s15/micr/sub-SChmi53_ses-20220927h13m20s15_sample-SChmi53_25_stain-LEC_run-1_chunk-5_SPIM.ome.zarr' but there is no asset at that path in 000108
0587160a-c8d3-4467-9a84-c8a5cc89abad/: has name='sub-SChmi53/ses-20220923h11m36s19/micr/sub-SChmi53_ses-20220923h11m36s19_sample-SChmi53_24_stain-NN_run-1_chunk-4_SPIM.ome.zarr' but there is no asset at that path in 000108
0592459a-f9a8-4631-a206-89bdfa1803fa/: has name='sub-SChmi53/ses-20220927h13m20s15/micr/sub-SChmi53_ses-20220927h13m20s15_sample-SChmi53_25_stain-YO_run-1_chunk-1_SPIM.ome.zarr' but there is no asset at that path in 000108
05a3eaef-f36a-4af4-9840-6988bb839e4b/: has name='sub-SChmi53/ses-20220114h16m04s17/micr/sub-SChmi53_ses-20220114h16m04s17_sample-SChmi53_9_stain-LEC_run-1_chunk-1_SPIM.ome.zarr' but there is no asset at that path in 000108
note: backref
Zarrs do in fact have a modified
field, although I don't think it's really used at the moment. Fixing the use of that field is part of a general refactor of the code base which we'd like to do, but that's a bit of an aside. So to answer your question, once we fix our use of that field, we would probably have the info we need to GC abandoned zarr uploads.
More generally, I think a reasonable approach would be to give some timeout to PENDING
zarrs - something 10 days - at which point if no action has been taken, an email could be sent to the owner of the dandiset which it belongs to, notifying them that the zarr will be garbage collected if no action is taken within the next x days.
sounds good! I guess it would make sense to expose that modified
field within the zarr record as a first step for it to become used -- I would then immediately assess on which zarrs without checksum are too old and we could interim contact owner manually to confirm that we could remove them.
While preparing perspective manifests for zarrs I encountered the case where computed checksum didn't correspond to the one in archive (None):
and indeed for that zarr we have it in
Pending
and no checksumFor that dandiset path we have another zarr listed (albeit with a different "name", but that is irrelevant -- click to expand more on that rabbit hole)
so I decided to look on what do we have for that dandiset path ```shell ❯ curl --silent 'https://api.dandiarchive.org/api/dandisets/000108/versions/draft/assets/?path=sub-SChmi53/ses-20211209h15m28s04/micr/sub-SChmi53_ses-20211209h15m28s04_sample-15_stain-LEC_run-1_chunk-6_SPIM.ome.zarr&metadata=false' | jq . { "count": 1, "next": null, "previous": null, "results": [ { "asset_id": "e49b0e5d-16d2-49a8-9af7-f93398f0c672", "blob": null, "zarr": "5bf2bd49-490e-4212-bd13-dca3f3399d65", "path": "sub-SChmi53/ses-20211209h15m28s04/micr/sub-SChmi53_ses-20211209h15m28s04_sample-15_stain-LEC_run-1_chunk-6_SPIM.ome.zarr", "size": 38783087406, "created": "2023-02-03T21:16:24.990201Z", "modified": "2023-02-07T16:57:59.631025Z" } ] } ``` so -- we are having a different zarr id for that path. Ok, I thought it might be just a matter of an unfinished zarr upload which we should GC (note -- it seems we do not have any time stamp for zarr record to judge how recent it is). Then I looked at that zarr returned above: ```shell ❯ curl --silent 'https://api.dandiarchive.org/api/zarr/5bf2bd49-490e-4212-bd13-dca3f3399d65/' | jq . { "name": "sub-SChmi53/ses-20211209h15m28s04/micr/sub-SChmi53_ses-20211209h15m28s04_sample-SChmi53_15_stain-LEC_run-1_chunk-6_SPIM.ome.zarr", "dandiset": "000108", "zarr_id": "5bf2bd49-490e-4212-bd13-dca3f3399d65", "status": "Complete", "checksum": "28e105c439865e01b19831be34972fd4-28498--38783087406", "file_count": 28498, "size": 38783087406 } ``` and spotted that the `path` (well -- "name") there is different -- has `_sample-SChmi53_15` not `_sample-15`. - `name` value in the zarr could differ from the `path` in the asset, so we have data redundancy/duplication and thus need extra checks -- thus [another candidate](https://github.com/search?q=repo%3Adandi%2Fdandi-archive+fsck&type=issues) for `fsck` style of verification. But I guess it is all not needed since it is just "name" and not "path", and we even have ongoing (slowly) discussion to get rid of that name - #1141Looking at zarr record here it seems that we do not have sufficient metadata to judge if such zarr upload should be considered "stale" and be GC'ed, and that zarr be removed from DB and S3 after we verify (how??? expand above on "name" vs "path") that this particular zarr is not used in any asset. Should we add some
"modified"
or"created"
field?