dandi / dandisets-healthstatus

Healthchecks of dandisets and support libraries (pynwb and matnwb)
0 stars 1 forks source link

Add timeout for the test job #8

Closed yarikoptic closed 1 year ago

yarikoptic commented 1 year ago

I see

two matnwb jobs seems just got stuck:
 732837 dandi     20   0 6286668 384120  22256 S 100.0   0.6   9444:38 MATLAB
 832307 dandi     20   0 6293812 337772  23176 S  98.1   0.5   9443:58 MATLAB

so we have two MATLAB jobs which should not take that long Here is details of invocation

dandi     732837 93.4  0.5 6286668 383544 pts/15 Sl+  Dec05 9481:36           /mnt/backup/apps/MATLAB/R2022b/bin/glnxa64/MATLAB -batch nwb = nwbRead('/mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse/000008/sub-mouse-AWOAY/sub-mouse-AWOAY_ses-20190829-sample-1_slice-20190829-slice-1_cell-20190829-sample-1_icephys.nwb', 'savedir', '/mnt/fast/dandi/dandisets-healthstatus') -nodesktop -prefersoftwareopengl
dandi     832307 93.4  0.5 6293812 337156 pts/15 Sl+  Dec05 9480:59           /mnt/backup/apps/MATLAB/R2022b/bin/glnxa64/MATLAB -batch nwb = nwbRead('/mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse/000008/sub-mouse-BVPYH/sub-mouse-BVPYH_ses-20181121-sample-6_slice-20181121-slice-3_cell-20181121-sample-6_icephys.nwb', 'savedir', '/mnt/fast/dandi/dandisets-healthstatus') -nodesktop -prefersoftwareopengl
yarikoptic commented 1 year ago

are you to do anything to troubleshoot those 2 running processes @jwodder or I could just kill them? (they are just wasting CPU AFAIK ATM)

jwodder commented 1 year ago

@yarikoptic You can kill them.

jwodder commented 1 year ago

@yarikoptic How (if at all) do you want timeouts displayed in the README? (Cf. #4.)

jwodder commented 1 year ago

@yarikoptic When I run MatNWB on the listed files directly (without going through FUSE), they both error out after about 20 seconds with Unable to resolve the name 'types.ndx_dandi_icephys.DandiIcephysMetadata'.

jwodder commented 1 year ago

@yarikoptic Ping.

yarikoptic commented 1 year ago

@yarikoptic How (if at all) do you want timeouts displayed in the README? (Cf. #4.)

Let's add one more column with timeouts.

@yarikoptic When I run MatNWB on the listed files directly (without going through FUSE), they both error out after about 20 seconds with Unable to resolve the name 'types.ndx_dandi_icephys.DandiIcephysMetadata'.

and if on fuse'd filesystem -- does it timeout or crash? The point is that if it crashes -- it should have crashed in our healthcheck process too.

Filed https://github.com/NeurodataWithoutBorders/matnwb/issues/481 . complement with any extra information you see missing.

jwodder commented 1 year ago

@yarikoptic Should the timeout column in the summary at the top include the IDs and number of assets for affected Dandisets, like is done for failures?

Also, if some assets of a Dandiset failed their healthchecks and other assets of that Dandiset timed out, should the Dandiset be listed under both "failed" and "timed out" in the summary?

and if on fuse'd filesystem -- does it timeout or crash?

It errors out as above, except it takes about a minute longer.

yarikoptic commented 1 year ago

@yarikoptic Should the timeout column in the summary at the top include the IDs and number of assets for affected Dandisets, like is done for failures?

I think uniform presentation would be the easiest to code, so let's do exactly the same -- so with number of assets.

Also, if some assets of a Dandiset failed their healthchecks and other assets of that Dandiset timed out, should the Dandiset be listed under both "failed" and "timed out" in the summary?

sounds right.

and if on fuse'd filesystem -- does it timeout or crash? It errors out as above, except it takes about a minute longer.

hm, so it remains unknown why it was hanging (not crashing) when running within our healthcheck, correct?