dandi / dandisets-healthstatus

Healthchecks of dandisets and support libraries (pynwb and matnwb)
0 stars 1 forks source link

stuck process in D -- I killed davfs2 #90

Open yarikoptic opened 3 weeks ago

yarikoptic commented 3 weeks ago

I am not sure if this is the same issue as

I see some zombies but the problem seems to be a stuck in D (so not killed at all?) process

dandi    2630713  0.0  0.0   8980  1560 ?        Ss   Aug07   0:00   SCREEN
dandi    2630714  0.0  0.0  10620  2816 pts/1    Ss   Aug07   0:00     /bin/bash
dandi    2630723  0.0  0.0   2580  1536 pts/1    S+   Aug07   0:00       /bin/sh tools/run_loop.sh
dandi    2949169  0.0  0.0  16300  7168 pts/1    S+   Aug10   3:56         /usr/bin/perl /usr/bin/chronic ./run.sh --mode random-outdated-asset-first
dandi    2949189  0.0  0.0   6932  1792 pts/1    S+   Aug10   0:00           /bin/bash ./run.sh --mode random-outdated-asset-first
dandi    2958341  0.0  0.0 1421476 2620 pts/1    Sl+  Aug10   3:15             /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/bin/dandisets-healthstatus check -m /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse -J 10 --mode random-outdated-asset-first
dandi     591845  0.0  0.0      0     0 pts/1    Zl+  Aug10   0:02               [python] <defunct>
dandi     592572  0.0  0.0      0     0 pts/1    Zl+  Aug10   0:02               [python] <defunct>
dandi     592663  0.0  0.0 412932  2988 pts/1    D+   Aug10   0:02               /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/healthstatus/pynwb_open_load_ns.py /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse/dandisets/000710/draft/sub-VR32/sub-VR32_ecephys.nwb

which had been running for 2 weeks now. We also have a bunch of df processes stuck and overall it all seems to be due to stuck davfs mount.

@jwodder , I see that we mount it with rw

davfs2   2958663  2.2  0.0  56396  5940 ?        Ss   Aug10 424:37 /sbin/mount.davfs https://webdav.dandiarchive.org /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse -o rw

is there a reason? I think ro would be sufficient for our use case, wouldn't it? if so -- what should be adjusted - if you could, please do that.

meanwhile I killed that davfs2 process since the whole thing was stuck. So the run of healthstatus might report more of errors etc. The script is now sleeping for its 600 seconds before the next round when it would mount it again I guess.

yarikoptic commented 3 weeks ago

meanwhile I am looking into upgrading davfs2 on drogon to 1.7.0 (backporting pkg from debian testing) so before I complain we are using most recent version

jwodder commented 3 weeks ago

@yarikoptic

I think ro would be sufficient for our use case, wouldn't it? if so -- what should be adjusted - if you could, please do that.

The mount command will need to be adjusted in both the source code and the sudoers file to include -o ro.

yarikoptic commented 3 weeks ago

ok, to sudoers I added a line with -t davfs -o ro so both could be used now. Please do necessary code adjustments and restart of that script.

jwodder commented 3 weeks ago

@yarikoptic FYI, the current run failed at the mount stage with:

healthstatus: Mounting davfs2 mount ...
/sbin/mount.davfs: found PID file /var/run/mount.davfs/mnt-backup-dandi-dandisets-healthstatus-dandisets-fuse.pid.
Either /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse is used by another process,
or another mount process ended irregular
yarikoptic commented 3 weeks ago

removed now

root@drogon:/mnt/backup# cat /var/run/mount.davfs/mnt-backup-dandi-dandisets-healthstatus-dandisets-fuse.pid
2958663
root@drogon:/mnt/backup# ps auxw | grep `!!`
ps auxw | grep `cat /var/run/mount.davfs/mnt-backup-dandi-dandisets-healthstatus-dandisets-fuse.pid`
root      141051  0.0  0.0   6332  2176 pts/22   S+   12:17   0:00 grep 2958663
root@drogon:/mnt/backup# rm /var/run/mount.davfs/mnt-backup-dandi-dandisets-healthstatus-dandisets-fuse.pid
jwodder commented 3 weeks ago

@yarikoptic The script is now running using a read-only mount.

yarikoptic commented 3 weeks ago

coolio, and davfs2 package was updated. Let's see where we would get! ATM it looks healthyeish albeit slow

dandi    2630713  0.0  0.0   9244  2584 ?        Ss   Aug07   0:00   SCREEN
dandi    2630714  0.0  0.0  11676  5504 pts/1    Ss   Aug07   0:00     /bin/bash
dandi     145300  0.0  0.0   2580  1536 pts/1    S+   12:34   0:00       /bin/sh tools/run_loop.sh
dandi     145302  0.0  0.0  15872 10752 pts/1    S+   12:34   0:01         /usr/bin/perl /usr/bin/chronic ./run.sh --mode random-outdated-asset-first
dandi     145303  0.0  0.0   6932  3200 pts/1    S+   12:34   0:00           /bin/bash ./run.sh --mode random-outdated-asset-first
dandi     145557  2.4  0.5 1454332 342136 pts/1  Sl+  12:34   2:36             /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/bin/dandisets-healthstatus check -m /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse -J 10 --mode random-outdated-asset-first
dandi     211730  0.0  0.1 413520 90132 pts/1    Sl+  13:39   0:02               /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/healthstatus/pynwb_open_load_ns.py /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse/dandisets/000051/draft/pons8-yo_16xdownsampled.nwb
dandi     211732  0.0  0.1 413520 90208 pts/1    Sl+  13:39   0:02               /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/healthstatus/pynwb_open_load_ns.py /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse/dandisets/000056/draft/sub-Mouse24/sub-Mouse24_ses-Mouse24-131216_behavior+ecephys.nwb
dandi     211734  0.0  0.1 413520 90080 pts/1    Sl+  13:39   0:02               /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/healthstatus/pynwb_open_load_ns.py /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse/dandisets/000054/draft/sub-R6/sub-R6_ses-20200209T210000_obj-1ouyda4_behavior+ophys.nwb
dandi     211739  0.0  0.1 413524 90092 pts/1    Sl+  13:39   0:02               /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/healthstatus/pynwb_open_load_ns.py /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse/dandisets/000055/draft/sub-09/sub-09_ses-5_behavior+ecephys.nwb
dandi     211747  0.0  0.1 413516 89644 pts/1    Sl+  13:39   0:02               /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/healthstatus/pynwb_open_load_ns.py /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse/dandisets/000053/draft/sub-Ella/sub-Ella_ses-20190402_behavior.nwb
dandi     211756  0.0  0.1 412924 89376 pts/1    Sl+  13:39   0:02               /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/healthstatus/pynwb_open_load_ns.py /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse/dandisets/000059/draft/sub-MS22/sub-MS22_ses-Peter-MS22-180712-102504-concat_desc-raw_ecephys.nwb
dandi     212244  0.0  0.1 412952 89376 pts/1    Sl+  13:40   0:02               /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/healthstatus/pynwb_open_load_ns.py /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse/dandisets/000061/draft/sub-Rat10/sub-Rat10_ses-Rat10-20140704_ecephys+image.nwb
dandi     212246  0.0  0.1 412872 89668 pts/1    Sl+  13:40   0:02               /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/healthstatus/pynwb_open_load_ns.py /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse/dandisets/000060/draft/sub-359855/sub-359855_ses-20161221_behavior+ecephys+ogen.nwb
dandi     212251  0.0  0.1 412920 89376 pts/1    Sl+  13:40   0:02               /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/healthstatus/pynwb_open_load_ns.py /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse/dandisets/000064/draft/sub-001/sub-001.nwb
dandi     215474  0.1  0.1 412996 89376 pts/1    Sl+  13:47   0:02               /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/healthstatus/pynwb_open_load_ns.py /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse/dandisets/000065/draft/sub-Kibbles/sub-Kibbles_behavior+ecephys.nwb
davfs2    145565 19.0  0.0  54176 10548 ?        Ss   12:34  20:32   /sbin/mount.davfs https://webdav.dandiarchive.org /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse -o ro

"slow" as above procetest processes seems to be already half an hour old, and it is only the mount.davfs which I see CPU busy (at 10-20%). Oddly those test processes are in S not in D mode.

yarikoptic commented 3 weeks ago

the davfs2 stalled already (only after a few hours) again... it is not yet 100% stalled, but just became super slow even for df or ls call

some gory details which lead me to the conclusion which might explain observed behavior: davfs2 (at least some times) decides to download full file instead of sparsely accessing it , e.g. ``` $> time ls -l /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse total 0 drwxr-xr-x 486 root root 0 Aug 23 12:34 dandisets/ drwx------ 2 root root 0 Aug 23 12:18 lost+found/ drwxr-xr-x 3007 root root 0 Aug 23 12:34 zarrs/ LC_COLLATE=POSIX ls -bCF --color=auto -l 0.00s user 0.00s system 0% cpu 1.853 total ``` and subsequent `df` also was faster than first one . After a few seconds, `df` became super slow again... it might be because it is very busy in IO itself trying to cache locally potentially the entire file it is accessing!!! ``` root@drogon:/home/yoh# du -scm /mnt/backup/cache/davfs2/webdav.dandiarchive.org+mnt-backup-dandi-dandisets-healthstatus-dandisets-fuse+root/ 300111 /mnt/backup/cache/davfs2/webdav.dandiarchive.org+mnt-backup-dandi-dandisets-healthstatus-dandisets-fuse+root/ 300111 total root@drogon:/home/yoh# du -scm --apparent-size /mnt/backup/cache/davfs2/webdav.dandiarchive.org+mnt-backup-dandi-dandisets-healthstatus-dandisets-fuse+root/ 300646 /mnt/backup/cache/davfs2/webdav.dandiarchive.org+mnt-backup-dandi-dandisets-healthstatus-dandisets-fuse+root/ 300646 total ``` so we have 300G of cache even though we have ``` root@drogon:/home/yoh# grep cache_size /etc/davfs2/davfs2.conf # cache_size 50 # MiByte cache_size 10000 ``` so cache should not exceed 10G? and most recent/currently modified and growing file is ``` root@drogon:/home/yoh# ls -ld /mnt/backup/cache/davfs2/webdav.dandiarchive.org+mnt-backup-dandi-dandisets-healthstatus-dandisets-fuse+root/jGCaMP8f_ANM478349_cell07.nwb-ZV8NTQ -rw------- 1 davfs2 davfs2 18067083509 Aug 23 16:45 /mnt/backup/cache/davfs2/webdav.dandiarchive.org+mnt-backup-dandi-dandisets-healthstatus-dandisets-fuse+root/jGCaMP8f_ANM478349_cell07.nwb-ZV8NTQ root@drogon:/home/yoh# ls -ld /mnt/backup/cache/davfs2/webdav.dandiarchive.org+mnt-backup-dandi-dandisets-healthstatus-dandisets-fuse+root/jGCaMP8f_ANM478349_cell07.nwb-ZV8NTQ -rw------- 1 davfs2 davfs2 18239804453 Aug 23 16:45 /mnt/backup/cache/davfs2/webdav.dandiarchive.org+mnt-backup-dandi-dandisets-healthstatus-dandisets-fuse+root/jGCaMP8f_ANM478349_cell07.nwb-ZV8NTQ ``` I "guess" for this one ``` /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse/dandisets/000168/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb ``` which is ``` ❯ dandi ls -f pyout dandi://dandi/000168@draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb ASSET_ID PATH SIZE CREATED MODIFIED BLOB 8fd3b1a0-3dab-4de2-82... jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb 19.4 GB 2022-01-21T03:36:10.7... 2022-01-21T03:50:24.00... 030b38c7-1196-4352-b8. ``` so 19GB. for this call ``` dandi 402971 0.3 0.1 413524 90176 pts/1 Sl+ 16:40 0:02 /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/healthstatus/pynwb_open_load_ns.py /mnt/ba ckup/dandi/dandisets-healthstatus/dandisets-fuse/dandisets/000168/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb ``` so -- python and we know that python does "sparse access". logs from davfs for it ``` $> journalctl -rg jGCaMP8f_ANM478349_cell07.nwb Aug 23 16:45:31 drogon mount.davfs[145565]: getattr /dandisets/000168/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb Aug 23 16:45:31 drogon mount.davfs[145565]: /dandisets/000168/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb Aug 23 16:40:49 drogon mount.davfs[145565]: open /dandisets/000168/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb Aug 23 16:40:49 drogon mount.davfs[145565]: lookup /dandisets/000168/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb Aug 23 16:40:49 drogon mount.davfs[145565]: p 0x55e27e4678c0, jGCaMP8f_ANM478349_cell07.nwb Aug 23 16:40:49 drogon mount.davfs[145565]: added /dandisets/000168/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb Aug 10 13:14:57 drogon mount.davfs[2958663]: /dandisets/000362/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb Aug 10 12:59:26 drogon mount.davfs[2958663]: added /dandisets/000362/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb Aug 08 16:22:30 drogon mount.davfs[1911909]: /dandisets/000362/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb Aug 08 16:21:20 drogon mount.davfs[1911909]: added /dandisets/000362/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb Aug 08 14:32:06 drogon mount.davfs[1911909]: /dandisets/000168/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb Aug 08 14:30:00 drogon mount.davfs[1911909]: /dandisets/000168/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb Aug 08 14:27:03 drogon mount.davfs[1911909]: /dandisets/000168/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb Aug 08 14:23:32 drogon mount.davfs[1911909]: /dandisets/000168/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb Aug 08 14:19:34 drogon mount.davfs[1911909]: /dandisets/000168/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb Aug 08 14:16:23 drogon mount.davfs[1911909]: added /dandisets/000168/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb Aug 04 15:06:58 drogon mount.davfs[3226681]: added /dandisets/000362/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb Aug 03 21:02:12 drogon mount.davfs[4048318]: /dandisets/000362/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb Aug 03 20:59:13 drogon mount.davfs[4048318]: added /dandisets/000362/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb Aug 03 01:02:42 drogon mount.davfs[3268345]: /dandisets/000362/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb Aug 03 00:58:08 drogon mount.davfs[3268345]: /dandisets/000362/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb Aug 03 00:43:47 drogon mount.davfs[3268345]: added /dandisets/000362/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb Aug 02 02:20:27 drogon mount.davfs[4084327]: /dandisets/000168/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb Aug 02 02:17:47 drogon mount.davfs[4084327]: /dandisets/000168/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb Aug 02 02:14:13 drogon mount.davfs[4084327]: /dandisets/000168/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb Aug 02 02:03:06 drogon mount.davfs[4084327]: /dandisets/000168/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb Aug 02 01:51:26 drogon mount.davfs[4084327]: /dandisets/000168/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb Aug 02 01:48:02 drogon mount.davfs[4084327]: added /dandisets/000168/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb Jul 31 10:16:13 drogon mount.davfs[1551830]: added /dandisets/000362/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb ``` and it seems we have downloaded it fully ```shell ❯ dandi ls -f json_pp dandi://dandi/000168@draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb | grep size "size": 19419844643 root@drogon:/home/dandi# du -scb /mnt/backup/cache/davfs2/webdav.dandiarchive.org+mnt-backup-dandi-dandisets-healthstatus-dandisets-fuse+root/jGCaMP8f_ANM478349_cell07.nwb-ZV8NTQ 19419844643 /mnt/backup/cache/davfs2/webdav.dandiarchive.org+mnt-backup-dandi-dandisets-healthstatus-dandisets-fuse+root/jGCaMP8f_ANM478349_cell07.nwb-ZV8NTQ ```
yarikoptic commented 3 weeks ago

looking at posts like https://savannah.nongnu.org/support/?110422 suggests that there is actually NO sparse caching , and full download of the file is expected! Did you check on what happens to the file while benchmarking davfs2 awhile back @jwodder ?

jwodder commented 3 weeks ago

@yarikoptic No.

yarikoptic commented 3 weeks ago

then davfs2 is likely not an acceptable solution for us. Do you have ideas/recommendations on how we should proceed?

jwodder commented 3 weeks ago

@yarikoptic My only other idea was to just download the files directly before operating on them, but you rejected that in discussion with Einar.

yarikoptic commented 3 weeks ago

FWIW: I filed https://savannah.nongnu.org/support/index.php?111110 for now.

Also found https://github.com/thehyve/davfs2/blob/main/TODO#L35 https://cvs.savannah.nongnu.org/viewvc/davfs2/davfs2/TODO?view=markup#l35

- ranged requests for GET (partial download)

(edit: interestingly most of it , including partial download one, was removed in '150ce86f45a7cd67235f748a1d3511b3f357cd0a (tag: rel-1-5-0)') . So may be just gave up on thinking TODO it since I do not see any reflection for RANGE in the code besides

❯ git grep -i '\<range\>'
src/webdav.c:        case 416:           /* Requested Range Not Satisfiable */

Full download are pretty much prohibitive -- we are in effect observing its effects with this davfs2 which was just spending (wasting) time downloading most of the time instead of doing quick sparse download only of necessary blocks.

I still think that we would be better off with webdav based FUSE solution in favor of reverting back to fsspec-based (may be with completely disabled caching to avoid multithreading fiascos) datalad-fuse... But might also be worth checking if there were any related changes in fsspec since then.

I will also think about this over weekend. I would appreciate if you also look into alternatives etc.