dandi / dandi-archive

DANDI API server and Web app
https://dandiarchive.org
13 stars 12 forks source link

Do we have enough information to GC abandoned Zarr upload attempt? #1842

Open yarikoptic opened 9 months ago

yarikoptic commented 9 months ago

While preparing perspective manifests for zarrs I encountered the case where computed checksum didn't correspond to the one in archive (None):

yoh@typhon:~/proj/dandi/zarr-manifests$ grep Checksum.*!= manifests-v2/00*.*.json
manifests-v2/00d8ce95-90d3-4fd0-a8b0-61e92d985720.versionid.json:  "zarrChecksum": "f632ce47f2798ed6195fe18d434c07c8-14280--19362514214 != None"

and indeed for that zarr we have it in Pending and no checksum

❯ curl --silent 'https://api.dandiarchive.org/api/zarr/00d8ce95-90d3-4fd0-a8b0-61e92d985720/' | jq .
{
  "name": "sub-SChmi53/ses-20211209h15m28s04/micr/sub-SChmi53_ses-20211209h15m28s04_sample-15_stain-LEC_run-1_chunk-6_SPIM.ome.zarr",
  "dandiset": "000108",
  "zarr_id": "00d8ce95-90d3-4fd0-a8b0-61e92d985720",
  "status": "Pending",
  "checksum": null,
  "file_count": 0,
  "size": 0
}
For that dandiset path we have another zarr listed (albeit with a different "name", but that is irrelevant -- click to expand more on that rabbit hole) so I decided to look on what do we have for that dandiset path ```shell ❯ curl --silent 'https://api.dandiarchive.org/api/dandisets/000108/versions/draft/assets/?path=sub-SChmi53/ses-20211209h15m28s04/micr/sub-SChmi53_ses-20211209h15m28s04_sample-15_stain-LEC_run-1_chunk-6_SPIM.ome.zarr&metadata=false' | jq . { "count": 1, "next": null, "previous": null, "results": [ { "asset_id": "e49b0e5d-16d2-49a8-9af7-f93398f0c672", "blob": null, "zarr": "5bf2bd49-490e-4212-bd13-dca3f3399d65", "path": "sub-SChmi53/ses-20211209h15m28s04/micr/sub-SChmi53_ses-20211209h15m28s04_sample-15_stain-LEC_run-1_chunk-6_SPIM.ome.zarr", "size": 38783087406, "created": "2023-02-03T21:16:24.990201Z", "modified": "2023-02-07T16:57:59.631025Z" } ] } ``` so -- we are having a different zarr id for that path. Ok, I thought it might be just a matter of an unfinished zarr upload which we should GC (note -- it seems we do not have any time stamp for zarr record to judge how recent it is). Then I looked at that zarr returned above: ```shell ❯ curl --silent 'https://api.dandiarchive.org/api/zarr/5bf2bd49-490e-4212-bd13-dca3f3399d65/' | jq . { "name": "sub-SChmi53/ses-20211209h15m28s04/micr/sub-SChmi53_ses-20211209h15m28s04_sample-SChmi53_15_stain-LEC_run-1_chunk-6_SPIM.ome.zarr", "dandiset": "000108", "zarr_id": "5bf2bd49-490e-4212-bd13-dca3f3399d65", "status": "Complete", "checksum": "28e105c439865e01b19831be34972fd4-28498--38783087406", "file_count": 28498, "size": 38783087406 } ``` and spotted that the `path` (well -- "name") there is different -- has `_sample-SChmi53_15` not `_sample-15`. - `name` value in the zarr could differ from the `path` in the asset, so we have data redundancy/duplication and thus need extra checks -- thus [another candidate](https://github.com/search?q=repo%3Adandi%2Fdandi-archive+fsck&type=issues) for `fsck` style of verification. But I guess it is all not needed since it is just "name" and not "path", and we even have ongoing (slowly) discussion to get rid of that name - #1141

Looking at zarr record here it seems that we do not have sufficient metadata to judge if such zarr upload should be considered "stale" and be GC'ed, and that zarr be removed from DB and S3 after we verify (how??? expand above on "name" vs "path") that this particular zarr is not used in any asset. Should we add some "modified" or "created" field?

yarikoptic commented 9 months ago
Using this script I compiled a full list of "odd" zarrs -- either having another zarr in their "name", or not having DB entry at all, or ... ```shell #!/bin/bash set -eu p=s3://dandiarchive/zarr #echo _ 383b7eb7-d589-40a5-86a9-1ade0b04922c/ \ aws --no-sign-request s3 ls $p/ \ | while read _ zarr; do z=${zarr%*/} # DEBUG since we crashed #echo "$zarr: " m=$(curl --silent "https://api.dandiarchive.org/api/zarr/$z/") if echo "$m" | grep -q '"Not found."'; then echo "$zarr: not found on API server" continue fi name=$(echo $m | jq -r .name) dandiset=$(echo $m | jq -r .dandiset) r=$(curl --silent "https://api.dandiarchive.org/api/dandisets/$dandiset/versions/draft/assets/?path=$name&metadata=false") count=$(echo $r | jq -r .count) if [ "$count" = 0 ]; then echo "$zarr: has name='$name' but there is no asset at that path in $dandiset" continue elif [ "$count" != 1 ]; then echo "$zarr: has multiple ($count) hits among assets for name='$name' in $dandiset" exit 1 # must not happen fi path=$(echo $r | jq -r .results[0].path) asset_zarr=$(echo $r | jq -r .results[0].zarr) if [ "$name" != "$path" ]; then echo "$zarr: has name='$name' but asset for that has path '$path' in $dandiset" echo $r | jq . break fi if [ "$z" != "$asset_zarr" ]; then echo "$zarr: asset for the name='$name' has asset zarr '$asset_zarr' in $dandiset" fi done ```

and there is a full log : http://www.oneukrainian.com/tmp/check_zarr_paths.log containing 395 records like

❯ head check_zarr_paths.log
00c111e3-3dac-4e4e-9ca0-1ee1a1a67059/: has name='sub-SChmi53/ses-20220923h13m25s47/micr/sub-SChmi53_ses-20220923h13m25s47_sample-SChmi53_28_stain-NN_run-1_chunk-5_SPIM.ome.zarr' but there is no asset at that path in 000108
00d8ce95-90d3-4fd0-a8b0-61e92d985720/: asset for the name='sub-SChmi53/ses-20211209h15m28s04/micr/sub-SChmi53_ses-20211209h15m28s04_sample-15_stain-LEC_run-1_chunk-6_SPIM.ome.zarr' has asset zarr '5bf2bd49-490e-4212-bd13-dca3f3399d65' in 000108
012511dd-72e1-4e28-80b4-8dedbeaf8462/: has name='sub-SChmi53/ses-20220916h10m54s03/micr/sub-SChmi53_ses-20220916h10m54s03_sample-SChmi53_32_stain-LEC_run-1_chunk-5_SPIM.ome.zarr' but there is no asset at that path in 000108
01911d14-a272-49d2-8556-b1333e5ef748/: has name='sub-SChmi53/ses-20211209h15m28s04/micr/sub-SChmi53_ses-20211209h15m28s04_sample-SChmi53_15_stain-YO_run-1_chunk-8_SPIM.ome.zarr' but there is no asset at that path in 000108
03cf98fe-0e9c-4180-bfa6-f1b4ef9471d3/: has name='sub-MITU01/ses-20220309h18m20s33/micr/sub-MITU01h3_ses-20220309h18m20s33_sample-17_stain-NN_run-1_chunk-6_SPIM.ome.zarr' but there is no asset at that path in 000108
03e43250-6de1-4e7a-aa5c-81f2e9662a2e/: has name='sub-SChmi53/ses-20220927h15m28s55/micr/sub-SChmi53_ses-20220927h15m28s55_sample-SChmi53_28_stain-YO_run-1_chunk-1_SPIM.ome.zarr' but there is no asset at that path in 000108
03f822dc-9f10-4454-bedf-0919cfb83a20/: has name='sub-SChmi53/ses-20220927h13m20s15/micr/sub-SChmi53_ses-20220927h13m20s15_sample-SChmi53_25_stain-LEC_run-1_chunk-5_SPIM.ome.zarr' but there is no asset at that path in 000108
0587160a-c8d3-4467-9a84-c8a5cc89abad/: has name='sub-SChmi53/ses-20220923h11m36s19/micr/sub-SChmi53_ses-20220923h11m36s19_sample-SChmi53_24_stain-NN_run-1_chunk-4_SPIM.ome.zarr' but there is no asset at that path in 000108
0592459a-f9a8-4631-a206-89bdfa1803fa/: has name='sub-SChmi53/ses-20220927h13m20s15/micr/sub-SChmi53_ses-20220927h13m20s15_sample-SChmi53_25_stain-YO_run-1_chunk-1_SPIM.ome.zarr' but there is no asset at that path in 000108
05a3eaef-f36a-4af4-9840-6988bb839e4b/: has name='sub-SChmi53/ses-20220114h16m04s17/micr/sub-SChmi53_ses-20220114h16m04s17_sample-SChmi53_9_stain-LEC_run-1_chunk-1_SPIM.ome.zarr' but there is no asset at that path in 000108

note: backref

jjnesbitt commented 9 months ago

Zarrs do in fact have a modified field, although I don't think it's really used at the moment. Fixing the use of that field is part of a general refactor of the code base which we'd like to do, but that's a bit of an aside. So to answer your question, once we fix our use of that field, we would probably have the info we need to GC abandoned zarr uploads.

More generally, I think a reasonable approach would be to give some timeout to PENDING zarrs - something 10 days - at which point if no action has been taken, an email could be sent to the owner of the dandiset which it belongs to, notifying them that the zarr will be garbage collected if no action is taken within the next x days.

yarikoptic commented 9 months ago

sounds good! I guess it would make sense to expose that modified field within the zarr record as a first step for it to become used -- I would then immediately assess on which zarrs without checksum are too old and we could interim contact owner manually to confirm that we could remove them.