dandi / dandisets-healthstatus

Healthchecks of dandisets and support libraries (pynwb and matnwb)
0 stars 1 forks source link

Some (all?) are false positives: Store outputs from running tests #19

Closed yarikoptic closed 1 year ago

yarikoptic commented 1 year ago

https://github.com/dandi/dandisets-healthstatus/blob/main/000008/status.yaml#L6 says that the file is "nok" for pynwb but test runs/it loads just fine:

~/proj/dandi/dandisets-fused/000008 ▓▒
❯ python /home/yoh/proj/dandi/dandisets-healthstatus/pynwb_open_load_ns.py sub-mouse-AEJGZ/sub-mouse-AEJGZ_ses-20180315-sample-5_slice-20180315-slice-5_cell-20180315-sample-5_icephys.nwb
/home/yoh/proj/dandi/dandisets-healthstatus/venvs/dev3/lib/python3.10/site-packages/hdmf/spec/namespace.py:531: UserWarning: Ignoring cached namespace 'hdmf-common' version 1.3.0 because version 1.5.1 is already loaded.
  warn("Ignoring cached namespace '%s' version %s because version %s is already loaded."
/home/yoh/proj/dandi/dandisets-healthstatus/venvs/dev3/lib/python3.10/site-packages/hdmf/spec/namespace.py:531: UserWarning: Ignoring cached namespace 'core' version 2.2.5 because version 2.5.0 is already loaded.
  warn("Ignoring cached namespace '%s' version %s because version %s is already loaded."
/home/yoh/proj/dandi/dandisets-healthstatus/venvs/dev3/lib/python3.10/site-packages/pynwb/icephys.py:187: UserWarning: Stimulus description 'NA' for IZeroClampSeries 'CurrentClampSeries010' is ignored and will be set to 'N/A' as per NWB 2.3.0.
  warnings.warn(
python /home/yoh/proj/dandi/dandisets-healthstatus/pynwb_open_load_ns.py   13.71s user 1.61s system 70% cpu 21.754 total
❯ echo $?
0

I do not think we want to pollute status.yaml with outputs for each file but it would be ok to have a separate dedicated file e.g. outputs.yaml where per each path we would have stdout/stderr/exit_code fields. That file would likely be large so let's place it into annex, keep on drogon in original spot and also post to some fork on e.g. https://gin.g-node.org/ or elsewhere.

yarikoptic commented 1 year ago

apparently logging to files of combined stdout/stderr is already there in case of errors: https://github.com/dandi/dandisets-healthstatus/blob/main/healthstatus.py#L354 and there are per-run (datestamped) files produced.

on drogon under /home/dandi/cronlib/dandisets-healthstatus we have not committed state of the repo with

(base) dandi@drogon:~/cronlib/dandisets-healthstatus$ grep Error 000008/2023.01.03.12.48.25_pynwb_open_load_ns_errors.log | sort | uniq -c
     10     OSError: Unable to read attribute (bad global heap collection signature)
      1     RuntimeError: Error iterating over attributes (attribute name has different length than stored length)
     32     RuntimeError: Unable to get group info (wrong B-tree signature)

suggesting some kind of an IO issue, may be due to parallel (due to async) access to different files through the same fsspec process. So most likely the underlying issue is related to the FUSE ... and indeed in fuse.log (seems to be overwritten by code, mdate now Jan 14 09:38 which is almost two weeks later than log for pynwb, but the run is still that slow/long) has lots of errors:

(base) dandi@drogon:~/cronlib/dandisets-healthstatus$ grep Error fuse.log | sed -e 's,KeyError: .https://dandiarchive.s3.amazonaws.com/blobs/.*,KeyError: https://dandiarchive.s3.amazonaws.com/blobs/BLOBG,g'  -e 's,KeyError: [0-9]\+, KeyError: INDEX,g' | sort | uniq -c | sort -n
      1 aiohttp.client_exceptions.ClientResponseError: 503, message='Service Unavailable', url=URL('https://dandiarchive.s3.amazonaws.com/blobs/925/d07/925d0754-f356-4fd4-8a94-6b43fb3950d8?response-content-disposition=attachment%3B%20filename%3D%22sub-YutaMouse54_ses-YutaMouse54-160701_behavior%2Becephys.nwb%22&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E/20230104/us-east-2/s3/aws4_request&X-Amz-Date=20230104T073007Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=31a21b4efbb0bf6fabe30ed52a13ea71f45fdcc7b0089558f956051a23e7a53c')
      1 aiohttp.client_exceptions.ClientResponseError: 503, message='Service Unavailable', url=URL('https://dandiarchive.s3.amazonaws.com/blobs/b1c/e49/b1ce4966-9f9f-4862-a6bd-2be667742fc0?response-content-disposition=attachment%3B%20filename%3D%22sub-738651046_ses-760693773_probe-769322824_ecephys.nwb%22&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E/20230105/us-east-2/s3/aws4_request&X-Amz-Date=20230105T070003Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=35d79b8f7a60e5250b86372fcff1f28958944a83f86a31753efb9adb47601858')
      2     raise ClientResponseError(
     19 asyncio.exceptions.TimeoutError
     19 fsspec.exceptions.FSTimeoutError
     19     raise asyncio.TimeoutError from None
     19     raise FSTimeoutError from return_result
     22 aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host dandiarchive.s3.amazonaws.com:443 ssl:default [None]
     22 ConnectionAbortedError: SSL handshake is taking longer than 60.0 seconds: aborting the connection
     22 NameError: name 'self' is not defined
     22 TypeError: '>' not supported between instances of 'NoneType' and 'int'
    191  KeyError: INDEX
    209 KeyError: https://dandiarchive.s3.amazonaws.com/blobs/BLOBG
    249 AttributeError: 'list' object has no attribute 'update'
    367 aiohttp.client_exceptions.ServerDisconnectedError: Server disconnected

where some (AttributeError: 'list' object has no attribute 'update') reminiscent of the problem I thought we worked around with locking, so odd... may be non-patched fsspec was used? I see that datalad-fuse without locking at its level was requested:

(base) dandi@drogon:~/cronlib/dandisets-healthstatus$ git diff requirements.txt
diff --git a/requirements.txt b/requirements.txt
index 14ab0f3..4b19e83 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,6 +1,7 @@
 anyio ~= 3.6
 click >= 8.0
-datalad-fuse
+#datalad-fuse
+datalad-fuse @ git+https://github.com/datalad/datalad-fuse@undo-fuse-lock
 importlib-metadata; python_version < "3.8"
 hdmf
 pynwb

I hope @jwodder could shine better light on where it was left off, but I feel that we better polish the run on a single (e.g. 000008) dandiset instead of trying to sweep through all of them and waiting for forever to complete.

yarikoptic commented 1 year ago

@jwodder already implemented this - as we do have logs stored and the original idea apparently was to store them but not to commit them. So, let's close this one