dandi / dandisets

735 Dandisets, 812.2 TB total. DataLad super-dataset of all Dandisets from https://github.com/dandisets
10 stars 0 forks source link

Use non-dandiset bound endpoint for the asset info for "delayed" query #296

Closed yarikoptic closed 1 year ago

yarikoptic commented 1 year ago

Current rerun while dealing with #293 on 000108 errored out with

  File "/mnt/backup/dandi/dandisets/tools/backups2datalad/syncer.py", line 36, in sync_assets
    self.report = await async_assets(
  File "/mnt/backup/dandi/dandisets/tools/backups2datalad/asyncer.py", line 499, in async_assets
    nursery.start_soon(dm.read_addurl)
  File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 662, in __aexit__
    raise exceptions[0]
  File "/mnt/backup/dandi/dandisets/tools/backups2datalad/zarr.py", line 526, in sync_zarr
    await zsync.run()
  File "/mnt/backup/dandi/dandisets/tools/backups2datalad/zarr.py", line 242, in run
    modern_asset = await self.asset.refetch()
  File "/mnt/backup/dandi/dandisets/tools/backups2datalad/adandi.py", line 397, in refetch
    return await self.dandiset.aget_asset(self.identifier)
  File "/mnt/backup/dandi/dandisets/tools/backups2datalad/adandi.py", line 201, in aget_asset
    info = await self.aclient.get(f"{self.version_api_path}assets/{asset_id}/info/")
  File "/mnt/backup/dandi/dandisets/tools/backups2datalad/adandi.py", line 64, in get
    return (await arequest(self.session, "GET", path, **kwargs)).json()
  File "/mnt/backup/dandi/dandisets/tools/backups2datalad/aioutil.py", line 123, in arequest
    r.raise_for_status()
  File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/httpx/_models.py", line 1510, in raise_for_status
    raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: Client error '404 Not Found' for url 'https://api.dandiarchive.org/api/dandisets/000108/versions/draft/assets/3219c6ab-b75d-4140-922b-fa288e4c9c65/info/'
For more information check: https://httpstatuses.com/404

which is "legit error" in that the asset likely has been replaced with another one (see first PUT below):

(venv) (base) dandi@drogon:/mnt/backup/dandi/heroku-logs/dandi-api$ grep 3219c6ab-b75d-4140-922b-fa288e4c9c65 2022110[89]*
20221108-1401.log:2022-11-08T19:49:44.781534+00:00 app[web.1]: 10.1.86.206 - - [08/Nov/2022:19:49:44 +0000] "PUT /api/dandisets/000108/versions/draft/assets/3219c6ab-b75d-4140-922b-fa288e4c9c65/ HTTP/1.1" 200 1589 "-" "dandi/0.46.3 requests/2.25.1 CPython/3.8.10"
20221108-1401.log:2022-11-08T19:49:44.780541+00:00 heroku[router]: at=info method=PUT path="/api/dandisets/000108/versions/draft/assets/3219c6ab-b75d-4140-922b-fa288e4c9c65/" host=api.dandiarchive.org request_id=d0d728ab-f6d2-4e5e-b22f-da5c1c037929 fwd="18.18.93.11" dyno=web.1 connect=0ms service=464ms status=200 bytes=1978 protocol=https
20221109-0901.log:2022-11-09T14:15:23.409650+00:00 heroku[router]: at=info method=GET path="/api/dandisets/000108/versions/draft/assets/3219c6ab-b75d-4140-922b-fa288e4c9c65/info/" host=api.dandiarchive.org request_id=897f109a-f075-45a0-8891-ae21807c3e43 fwd="129.170.233.10" dyno=web.1 connect=0ms service=22ms status=404 bytes=404 protocol=https
20221109-0901.log:2022-11-09T14:15:23.485363+00:00 app[web.1]: 10.1.87.225 - - [09/Nov/2022:14:15:23 +0000] "GET /api/dandisets/000108/versions/draft/assets/3219c6ab-b75d-4140-922b-fa288e4c9c65/info/ HTTP/1.1" 404 23 "-" "backups2datalad (https://github.com/dandi/dandisets) httpx/0.22.0 CPython/3.8.13"
20221109-0901.log:2022-11-09T14:25:53.864567+00:00 heroku[router]: at=info method=GET path="/api/dandisets/000108/versions/draft/assets/3219c6ab-b75d-4140-922b-fa288e4c9c65/info/" host=api.dandiarchive.org request_id=5a057b05-6f39-412e-815e-ca1626dc370b fwd="76.24.253.1" dyno=web.1 connect=0ms service=12ms status=404 bytes=404 protocol=https
20221109-0901.log:2022-11-09T14:25:53.863552+00:00 app[web.1]: 10.1.95.107 - - [09/Nov/2022:14:25:53 +0000] "GET /api/dandisets/000108/versions/draft/assets/3219c6ab-b75d-4140-922b-fa288e4c9c65/info/ HTTP/1.1" 404 23 "-" "curl/7.85.0"

so indeed if dandiset has changes in the assets from the moment it initially got the list to the moment it decided to query more information about it, that asset might no longer be associated with dandiset, and thus 404. BUT information about the asset (not possibly "loose" and subject to eventual GC) would still be available from generic endpoint, so I think we should use that one instead here

❯ curl --silent https://api.dandiarchive.org/api/assets/3219c6ab-b75d-4140-922b-fa288e4c9c65/info/ | jq . | head
{
  "asset_id": "3219c6ab-b75d-4140-922b-fa288e4c9c65",
  "blob": null,
  "zarr": "ad50ab40-2346-4528-b077-41eedf00c090",
  "path": "sub-MITU01/ses-20220316h10m52s23/micr/sub-MITU01_ses-20220316h10m52s23_sample-12_stain-YO_run-1_chunk-2_SPIM.ome.zarr",
  "size": 68128106957,
  "created": "2022-11-08T18:06:04.549362Z",
  "modified": "2022-11-08T18:06:04.549388Z",
  "metadata": {
    "id": "dandiasset:3219c6ab-b75d-4140-922b-fa288e4c9c65",
...

I could be wrong though (I didn't check if both endpoints return identical records, but I assume so).

But related aspect/question -- why do we have this delayed dedicated per-asset query??? if to get metadata for the asset whenever listing was done without getting metadata -- dandisets_version_asset_list endpoint now has metadata parameter so we could get all desired metadata (if that is what this call for) during listing of assets for the dandiset, and this way getting a better chance to get a consistently listing of assets with their metadata. So the solution might be two tiers -- use non-bound to dandiset endpoint to just guarantee robustness in possibly other code paths, and then switch to get metadata while getting a list of all assets and thus avoid doing this per asset querying.

jwodder commented 1 year ago

@yarikoptic At the end of backing up a Zarr, data for the Zarr asset is requested again in order to check whether the asset's modified timestamp changed during the backup. That's the request that failed here.

yarikoptic commented 1 year ago

I see -- thanks for the explanation! So the 2nd aspect (getting metadata right away) is not pertinent. But as for the first one on ensuring that for that particular asset it didn't change -- I think we should just use that /assets/ not bound to dandiset endpoint since we are aiming to reach the state of dandiset as it was with those assets (even if changed since then). Agree?