Closed yarikoptic closed 1 year ago
apparently logging to files of combined stdout/stderr is already there in case of errors: https://github.com/dandi/dandisets-healthstatus/blob/main/healthstatus.py#L354 and there are per-run (datestamped) files produced.
on drogon under /home/dandi/cronlib/dandisets-healthstatus
we have not committed state of the repo with
(base) dandi@drogon:~/cronlib/dandisets-healthstatus$ grep Error 000008/2023.01.03.12.48.25_pynwb_open_load_ns_errors.log | sort | uniq -c
10 OSError: Unable to read attribute (bad global heap collection signature)
1 RuntimeError: Error iterating over attributes (attribute name has different length than stored length)
32 RuntimeError: Unable to get group info (wrong B-tree signature)
suggesting some kind of an IO issue, may be due to parallel (due to async) access to different files through the same fsspec process. So most likely the underlying issue is related to the FUSE ... and indeed in fuse.log
(seems to be overwritten by code, mdate now Jan 14 09:38 which is almost two weeks later than log for pynwb, but the run is still that slow/long) has lots of errors:
(base) dandi@drogon:~/cronlib/dandisets-healthstatus$ grep Error fuse.log | sed -e 's,KeyError: .https://dandiarchive.s3.amazonaws.com/blobs/.*,KeyError: https://dandiarchive.s3.amazonaws.com/blobs/BLOBG,g' -e 's,KeyError: [0-9]\+, KeyError: INDEX,g' | sort | uniq -c | sort -n
1 aiohttp.client_exceptions.ClientResponseError: 503, message='Service Unavailable', url=URL('https://dandiarchive.s3.amazonaws.com/blobs/925/d07/925d0754-f356-4fd4-8a94-6b43fb3950d8?response-content-disposition=attachment%3B%20filename%3D%22sub-YutaMouse54_ses-YutaMouse54-160701_behavior%2Becephys.nwb%22&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E/20230104/us-east-2/s3/aws4_request&X-Amz-Date=20230104T073007Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=31a21b4efbb0bf6fabe30ed52a13ea71f45fdcc7b0089558f956051a23e7a53c')
1 aiohttp.client_exceptions.ClientResponseError: 503, message='Service Unavailable', url=URL('https://dandiarchive.s3.amazonaws.com/blobs/b1c/e49/b1ce4966-9f9f-4862-a6bd-2be667742fc0?response-content-disposition=attachment%3B%20filename%3D%22sub-738651046_ses-760693773_probe-769322824_ecephys.nwb%22&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E/20230105/us-east-2/s3/aws4_request&X-Amz-Date=20230105T070003Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=35d79b8f7a60e5250b86372fcff1f28958944a83f86a31753efb9adb47601858')
2 raise ClientResponseError(
19 asyncio.exceptions.TimeoutError
19 fsspec.exceptions.FSTimeoutError
19 raise asyncio.TimeoutError from None
19 raise FSTimeoutError from return_result
22 aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host dandiarchive.s3.amazonaws.com:443 ssl:default [None]
22 ConnectionAbortedError: SSL handshake is taking longer than 60.0 seconds: aborting the connection
22 NameError: name 'self' is not defined
22 TypeError: '>' not supported between instances of 'NoneType' and 'int'
191 KeyError: INDEX
209 KeyError: https://dandiarchive.s3.amazonaws.com/blobs/BLOBG
249 AttributeError: 'list' object has no attribute 'update'
367 aiohttp.client_exceptions.ServerDisconnectedError: Server disconnected
where some (AttributeError: 'list' object has no attribute 'update'
) reminiscent of the problem I thought we worked around with locking, so odd... may be non-patched fsspec was used? I see that datalad-fuse without locking at its level was requested:
(base) dandi@drogon:~/cronlib/dandisets-healthstatus$ git diff requirements.txt
diff --git a/requirements.txt b/requirements.txt
index 14ab0f3..4b19e83 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,6 +1,7 @@
anyio ~= 3.6
click >= 8.0
-datalad-fuse
+#datalad-fuse
+datalad-fuse @ git+https://github.com/datalad/datalad-fuse@undo-fuse-lock
importlib-metadata; python_version < "3.8"
hdmf
pynwb
I hope @jwodder could shine better light on where it was left off, but I feel that we better polish the run on a single (e.g. 000008) dandiset instead of trying to sweep through all of them and waiting for forever to complete.
@jwodder already implemented this - as we do have logs stored and the original idea apparently was to store them but not to commit them. So, let's close this one
https://github.com/dandi/dandisets-healthstatus/blob/main/000008/status.yaml#L6 says that the file is "nok" for pynwb but test runs/it loads just fine:
I do not think we want to pollute status.yaml with outputs for each file but it would be ok to have a separate dedicated file e.g.
outputs.yaml
where per each path we would have stdout/stderr/exit_code fields. That file would likely be large so let's place it into annex, keep on drogon in original spot and also post to some fork on e.g. https://gin.g-node.org/ or elsewhere.