Replace datalad-fuse with davfs2

jwodder commented 3 months ago

Closes #76.

After this is merged, ~dandi/cronlib/dandisets-healthstatus on drogon will have to be updated to the new code (manually?).

codecov[bot] commented 3 months ago

Codecov Report

Attention: Patch coverage is 33.33333% with 4 lines in your changes missing coverage. Please review.

Project coverage is 60.62%. Comparing base (c51d691) to head (2edf8e0). Report is 2 commits behind head on main.

Files	Patch %	Lines
code/src/healthstatus/mounts.py	33.33%	4 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #77 +/- ## ========================================== - Coverage 60.71% 60.62% -0.10% ========================================== Files 10 10 Lines 840 838 -2 Branches 193 192 -1 ========================================== - Hits 510 508 -2 Misses 310 310 Partials 20 20 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

jwodder commented 3 months ago

@yarikoptic

ask_auth needs to be set to 0 in /etc/davfs2/davfs2.conf
It appears that dandi@drogon does not have permission to run the appropriate mount & umount commands via passwordless sudo. At the very least, sudo -ln complains that a password is required.
- Based on this comment, you set up sudo for jwodder, but that's not the user we're running dandisets-healthstatus under.

yarikoptic commented 3 months ago

done

done

dandi ALL=(ALL:ALL) NOPASSWD: /usr/bin/umount /tmp/dandisets-fuse
dandi ALL=(ALL:ALL) NOPASSWD: /usr/bin/mount -t davfs https\://webdav.dandiarchive.org /tmp/dandisets-fuse
dandi ALL=(ALL:ALL) NOPASSWD: /usr/bin/umount /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse
dandi ALL=(ALL:ALL) NOPASSWD: /usr/bin/mount -t davfs https\://webdav.dandiarchive.org /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse

jwodder commented 3 months ago

@yarikoptic The setup should be complete now. Should I just run it in one of my screen sessions or leave that up to you?

yarikoptic commented 3 months ago

please run under your screen so to check if all is good etc.

jwodder commented 3 months ago

@yarikoptic Problem: The code previously recorded the hash of the current commit of each Dandiset repository, but with davfs2, we don't have access to that information.

yarikoptic commented 3 months ago

that's interesting! we need some kind of a notion of a version, e.g. last-modified-date. Could you query from dandi-archive API per each dandiset?

we are going through drafts only avoiding releases, right?

yarikoptic commented 3 months ago

hold on -- you expose that in webdav:

does davfs2 expose it as mtime for the draft/ folder ? if so -- take that and be done?

jwodder commented 3 months ago

@yarikoptic The script seems to be running fine now, but now just traversing Dandiset file hierarchies seems slower than before.

yarikoptic commented 3 months ago

how much slower? my "diagnostic" is the fact that there is barely any python or MATLAB cpu busy tasks :-/

yarikoptic commented 3 months ago

it might relate to

https://github.com/dandi/dandidav/issues/118 as when I tail -f the current log file on drogon:/mnt/backup/dandi/heroku-logs/dandidav it "pauses" at heroku[web.1]: Error R14 (Memory quota exceeded) although it could be red-herring.

jwodder commented 3 months ago

@yarikoptic I had to restart the script in order to apply a fix. Looking at the script's output now, just the listing of the dandisets/ directory took 30 seconds to get from 000003 to 000026, at which point it got stuck for three minutes, then four more entries were read in 6 seconds, and it now seems to be alternating between stalling for several minutes and retrieving small numbers of entries. I do not have any timings for datalad-fuse to compare against.

Before you propose any explanations, let me point out that requests to dandidav to list the entries of dandisets/ are responded to with the complete directory listing, so this is presumably davfs2's fault for not caching appropriately. It seems v1.7.0 of davfs2 added caching, but it ended up making things slower.

jwodder commented 2 months ago

@yarikoptic I'm almost done rewriting the code to traverse Dandisets via the Archive API instead of mounted-WebDAV (as discussed in the meeting last week). However, there are issues with updating the test-files subcommand (for running a test on a local file and optionally saving the test results to the Dandiset's status.yaml). Do we want to keep this subcommand? What about keeping its ability to save to status.yaml? If yes to both, should the command assume that the given file paths are always in a WebDAV mount? If not, how should the Dandiset ID and asset path be determined?

yarikoptic commented 2 months ago

Do we have dandiset.yaml in the top of the dandiset? If yes -- just traversal up until hitting it and that would give you ID and top path to deduce asset path. Cache results of such helper function per each folder value so we do not re-traverse without need.

Can't check now since I believe davfs2 is simply hanging and thus two hanging jobs I have mentioned last week:

dandi    4133314  0.0  0.0      0     0 ?        D    Jun29   0:02 [python]
dandi    4188696  0.0  0.0 422664  4408 ?        D    Jun29   0:10 /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/healthstatus/pynwb_open_load_ns.py /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse/dandisets/000045/draft/sub-01be78e7-8741-4b40-bd64-79ed745431b5/sub-01be78e7-8741-4b40-bd64-79ed745431b5_ses-f6f55a42-6105-4cc9-9c8c-3d3264828b8b_behavior+image.nwb
dandi    4188715  0.0  0.0 414872  4420 ?        D    Jun29   0:02 /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/healthstatus/pynwb_open_load_ns.py /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse/dandisets/000059/draft/sub-MS12/sub-MS12_ses-Peter-MS12-170719-095305-concat_desc-raw_ecephys.nwb
dandi    4189021  0.0  0.0      0     0 ?        Zl   Jun29   0:02 [python] <defunct>
dandi    4189095  0.0  0.0      0     0 ?        Zl   Jun29   0:02 [python] <defunct>

those should be killed, davfs2 restarted.

jwodder commented 2 months ago

@yarikoptic

Do we have dandiset.yaml in the top of the dandiset?

If test-files is run on a clone, copy, or WebDAV-mount of a Dandiset, yes. However, if it's run on a davfs2 mount, experience suggests that traversing upwards looking for a dandiset.yaml will be slow.

Moreover, if updating status.yaml, currently test-files needs to traverse the whole of the Dandiset it's operating on in order to get all asset paths in order to prune assets from status.yaml that have since been deleted. The options for handling this are:

Traverse the filesystem normally — will be slow if using a davfs2 mount
Get the list of assets from the Archive API — If the local Dandiset is not up to date with the archive (which can happen when not using a WebDAV mount), this can lead to problems, such as the very file that was tested being deleted from status.yaml.
Add a CLI option for selecting one of the above two behaviors
Remove test-files's behavior of pruning deleted assets

Which one should I go with?

those should be killed, davfs2 restarted.

I ran kill -9 on the pynwb_open_load_ns.py last week, but they're still stuck. I don't know what those zombie processes are. I sent you a message in Slack asking you to use sudo to kill the davsf2 process. I tried unmounting the davfs2 mount when I started writing this comment, but the process has just been hanging since.

yarikoptic commented 2 months ago

However, if it's run on a davfs2 mount, experience suggests that traversing upwards looking for a dandiset.yaml will be slow.

we do not even need to list the directory, just rely on os.path.lexists or alike while lru caching the call so that for a folder we never do it twice through runtime of that process. Moreover could be optimized that if any of the parents already known to have dandiset.yaml - no need to ask for any child folder. This way it would literally be once per dandiset traversal up.

jwodder commented 2 months ago

@yarikoptic Please address the rest of my comment as well.

yarikoptic commented 2 months ago

Re test-files -- couldn't we just completely disable dealing with the files which are not given to the test-files invocation, i.e. pruning them etc? Pruning deleted paths in general should probably be the job for "dandiset-wide" operation (even if we randomly select only some to go through).

jwodder commented 2 months ago

@yarikoptic Yes, that is option number 4.

yarikoptic commented 2 months ago

Cool, let's go with it

jwodder commented 2 months ago

@yarikoptic Another problem: When running test-files on a Dandiset, what should be done about updating the draft_modified timestamp in status.yaml? Should it be updated and, if so, to what value/how should that value be obtained? What if status.yaml does not exist yet or has not yet been updated to use draft_modified?

yarikoptic commented 2 months ago

For test-files -- i.e. running test on specific files only -- I do not think we should update draft_modified . It is mostly an interface to troubleshoot errors and timeouts while also "not wasting" doing so and possibly updating those files records. As such, can avoid concerning itself with any dandiset wide (removal of gone entries, draft_modifed, etc) operations.

dandi / dandisets-healthstatus

Replace datalad-fuse with davfs2 #77

Codecov Report