Open yarikoptic opened 3 weeks ago
meanwhile I am looking into upgrading davfs2 on drogon to 1.7.0 (backporting pkg from debian testing) so before I complain we are using most recent version
@yarikoptic
I think
ro
would be sufficient for our use case, wouldn't it? if so -- what should be adjusted - if you could, please do that.
The mount
command will need to be adjusted in both the source code and the sudoers file to include -o ro
.
ok, to sudoers I added a line with -t davfs -o ro
so both could be used now. Please do necessary code adjustments and restart of that script.
@yarikoptic FYI, the current run failed at the mount stage with:
healthstatus: Mounting davfs2 mount ...
/sbin/mount.davfs: found PID file /var/run/mount.davfs/mnt-backup-dandi-dandisets-healthstatus-dandisets-fuse.pid.
Either /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse is used by another process,
or another mount process ended irregular
removed now
root@drogon:/mnt/backup# cat /var/run/mount.davfs/mnt-backup-dandi-dandisets-healthstatus-dandisets-fuse.pid
2958663
root@drogon:/mnt/backup# ps auxw | grep `!!`
ps auxw | grep `cat /var/run/mount.davfs/mnt-backup-dandi-dandisets-healthstatus-dandisets-fuse.pid`
root 141051 0.0 0.0 6332 2176 pts/22 S+ 12:17 0:00 grep 2958663
root@drogon:/mnt/backup# rm /var/run/mount.davfs/mnt-backup-dandi-dandisets-healthstatus-dandisets-fuse.pid
@yarikoptic The script is now running using a read-only mount.
coolio, and davfs2 package was updated. Let's see where we would get! ATM it looks healthyeish albeit slow
dandi 2630713 0.0 0.0 9244 2584 ? Ss Aug07 0:00 SCREEN
dandi 2630714 0.0 0.0 11676 5504 pts/1 Ss Aug07 0:00 /bin/bash
dandi 145300 0.0 0.0 2580 1536 pts/1 S+ 12:34 0:00 /bin/sh tools/run_loop.sh
dandi 145302 0.0 0.0 15872 10752 pts/1 S+ 12:34 0:01 /usr/bin/perl /usr/bin/chronic ./run.sh --mode random-outdated-asset-first
dandi 145303 0.0 0.0 6932 3200 pts/1 S+ 12:34 0:00 /bin/bash ./run.sh --mode random-outdated-asset-first
dandi 145557 2.4 0.5 1454332 342136 pts/1 Sl+ 12:34 2:36 /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/bin/dandisets-healthstatus check -m /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse -J 10 --mode random-outdated-asset-first
dandi 211730 0.0 0.1 413520 90132 pts/1 Sl+ 13:39 0:02 /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/healthstatus/pynwb_open_load_ns.py /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse/dandisets/000051/draft/pons8-yo_16xdownsampled.nwb
dandi 211732 0.0 0.1 413520 90208 pts/1 Sl+ 13:39 0:02 /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/healthstatus/pynwb_open_load_ns.py /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse/dandisets/000056/draft/sub-Mouse24/sub-Mouse24_ses-Mouse24-131216_behavior+ecephys.nwb
dandi 211734 0.0 0.1 413520 90080 pts/1 Sl+ 13:39 0:02 /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/healthstatus/pynwb_open_load_ns.py /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse/dandisets/000054/draft/sub-R6/sub-R6_ses-20200209T210000_obj-1ouyda4_behavior+ophys.nwb
dandi 211739 0.0 0.1 413524 90092 pts/1 Sl+ 13:39 0:02 /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/healthstatus/pynwb_open_load_ns.py /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse/dandisets/000055/draft/sub-09/sub-09_ses-5_behavior+ecephys.nwb
dandi 211747 0.0 0.1 413516 89644 pts/1 Sl+ 13:39 0:02 /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/healthstatus/pynwb_open_load_ns.py /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse/dandisets/000053/draft/sub-Ella/sub-Ella_ses-20190402_behavior.nwb
dandi 211756 0.0 0.1 412924 89376 pts/1 Sl+ 13:39 0:02 /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/healthstatus/pynwb_open_load_ns.py /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse/dandisets/000059/draft/sub-MS22/sub-MS22_ses-Peter-MS22-180712-102504-concat_desc-raw_ecephys.nwb
dandi 212244 0.0 0.1 412952 89376 pts/1 Sl+ 13:40 0:02 /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/healthstatus/pynwb_open_load_ns.py /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse/dandisets/000061/draft/sub-Rat10/sub-Rat10_ses-Rat10-20140704_ecephys+image.nwb
dandi 212246 0.0 0.1 412872 89668 pts/1 Sl+ 13:40 0:02 /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/healthstatus/pynwb_open_load_ns.py /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse/dandisets/000060/draft/sub-359855/sub-359855_ses-20161221_behavior+ecephys+ogen.nwb
dandi 212251 0.0 0.1 412920 89376 pts/1 Sl+ 13:40 0:02 /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/healthstatus/pynwb_open_load_ns.py /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse/dandisets/000064/draft/sub-001/sub-001.nwb
dandi 215474 0.1 0.1 412996 89376 pts/1 Sl+ 13:47 0:02 /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/healthstatus/pynwb_open_load_ns.py /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse/dandisets/000065/draft/sub-Kibbles/sub-Kibbles_behavior+ecephys.nwb
davfs2 145565 19.0 0.0 54176 10548 ? Ss 12:34 20:32 /sbin/mount.davfs https://webdav.dandiarchive.org /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse -o ro
"slow" as above procetest processes seems to be already half an hour old, and it is only the mount.davfs
which I see CPU busy (at 10-20%). Oddly those test processes are in S not in D mode.
the davfs2 stalled already (only after a few hours) again... it is not yet 100% stalled, but just became super slow even for df
or ls
call
looking at posts like https://savannah.nongnu.org/support/?110422 suggests that there is actually NO sparse caching , and full download of the file is expected! Did you check on what happens to the file while benchmarking davfs2 awhile back @jwodder ?
@yarikoptic No.
then davfs2 is likely not an acceptable solution for us. Do you have ideas/recommendations on how we should proceed?
@yarikoptic My only other idea was to just download the files directly before operating on them, but you rejected that in discussion with Einar.
FWIW: I filed https://savannah.nongnu.org/support/index.php?111110 for now.
Also found https://github.com/thehyve/davfs2/blob/main/TODO#L35 https://cvs.savannah.nongnu.org/viewvc/davfs2/davfs2/TODO?view=markup#l35
- ranged requests for GET (partial download)
(edit: interestingly most of it , including partial download one, was removed in '150ce86f45a7cd67235f748a1d3511b3f357cd0a (tag: rel-1-5-0)') . So may be just gave up on thinking TODO it since I do not see any reflection for RANGE in the code besides
❯ git grep -i '\<range\>'
src/webdav.c: case 416: /* Requested Range Not Satisfiable */
Full download are pretty much prohibitive -- we are in effect observing its effects with this davfs2 which was just spending (wasting) time downloading most of the time instead of doing quick sparse download only of necessary blocks.
I still think that we would be better off with webdav based FUSE solution in favor of reverting back to fsspec-based (may be with completely disabled caching to avoid multithreading fiascos) datalad-fuse... But might also be worth checking if there were any related changes in fsspec since then.
I will also think about this over weekend. I would appreciate if you also look into alternatives etc.
I am not sure if this is the same issue as
80
I see some zombies but the problem seems to be a stuck in D (so not killed at all?) process
which had been running for 2 weeks now. We also have a bunch of
df
processes stuck and overall it all seems to be due to stuck davfs mount.@jwodder , I see that we mount it with
rw
is there a reason? I think
ro
would be sufficient for our use case, wouldn't it? if so -- what should be adjusted - if you could, please do that.meanwhile I killed that davfs2 process since the whole thing was stuck. So the run of healthstatus might report more of errors etc. The script is now sleeping for its 600 seconds before the next round when it would mount it again I guess.