Add benchmarking of various mounting strategies

jwodder commented 8 months ago

Closes #66.

To do:

[x] Implement mounting & unmounting with webdavfs
[x] Implement mounting & unmounting with davfs2
[x] Implement timing of the following tests:
- [x] pynwb_open_load_ns
- [x] matnwb_nwbRead
- [x] DANDI_CACHE=ignore dandi ls (to load metadata) on a single local asset
[x] Add a subcommand for doing all the benchmarking at once (mounting each mount and running & timing of tests)
- [x] Emit a summary of timing results
- [x] Add an option for only using specified mount types
- [x] Add an option for whether to update the local clone of the dandisets repo for fusefs
[x] Add a subcommand that just runs & times the tests against a path given on the command line
[x] When mounting with fusefs, clone the dandisets repository first if it isn't present at the dataset path.
PROBLEM: webdavfs doesn't support redirects: https://github.com/miquels/webdavfs/issues/30
- [x] Comment out the webdavfs support for now
[x] Document how to set up davfs2 for use by dandisets-healthstatus
[x] Actually run benchmarks
- Run on smaug
- The assets to test should be one (or more?) sample assets of some "typical" size (a few GBs). sub-mouse1-fni16/sub-mouse1-fni16_ses-161228151100.nwb in 000016 is suggested as a possible candidate.

codecov[bot] commented 8 months ago

Codecov Report

Attention: Patch coverage is 64.25703% with 89 lines in your changes missing coverage. Please review.

Project coverage is 60.71%. Comparing base (d71a25d) to head (e331c4c). Report is 78 commits behind head on main.

Files	Patch %	Lines
code/src/healthstatus/mounts.py	57.27%	46 Missing and 1 partial :warning:
code/src/healthstatus/tests.py	69.35%	19 Missing :warning:
code/src/healthstatus/__main__.py	61.53%	10 Missing :warning:
code/src/healthstatus/checker.py	64.00%	8 Missing and 1 partial :warning:
code/src/healthstatus/core.py	87.50%	3 Missing :warning:
code/src/healthstatus/util.py	0.00%	1 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #67 +/- ## ========================================== + Coverage 60.43% 60.71% +0.27% ========================================== Files 9 10 +1 Lines 685 840 +155 Branches 169 193 +24 ========================================== + Hits 414 510 +96 - Misses 251 310 +59 Partials 20 20 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

satra commented 8 months ago

since s3 can have different latencies at different times of the day, let's also make sure we have some estimate of s3 latency during each benchmark if these benchmarks take some time to run. if they run in mins, i would be less worried about latencies, and in such a scenario we should just get multiple estimates to create some error bars.

jwodder commented 8 months ago

@satra

make sure we have some estimate of s3 latency

How?

satra commented 8 months ago

this is old, but something like this: https://github.com/dvassallo/s3-benchmark

jwodder commented 8 months ago

@satra This seems like something that should be done separately from dandisets-healthstatus. Trying to integrate it into this PR doesn't seem sensible.

jwodder commented 8 months ago

@yarikoptic Problem: dandisets-healthstatus requires Pydantic 2.0, yet this PR adds an extra dependency on dandi (and dandidav, which requires dandi), which still requires Pydantic 1.x.

yarikoptic commented 8 months ago

since it all in motion, I think it would be ok to point to that branch you have for dandi-cli with pydantic 2.0 compat

yarikoptic commented 6 months ago

Please provide results of running such benchmarking across possible solutions e.g. on typhon. (should be less busy ATM)

yarikoptic commented 6 months ago

scrape that about typhon, I forgot that we rely on having dandisets around. Please do it on drogon.

jwodder commented 6 months ago

@yarikoptic You initially said here to run the benchmarks on smaug.

yarikoptic commented 6 months ago

if benchmarks rely on full clone of dandisets/ hierarchy, probably best to just run on drogon. If you want to replicate the hierarchy then indeed can do on smaug or typhon. Choose the host you deem most appropriate for this.

jwodder commented 6 months ago

@yarikoptic I need permission to sudo-run the following commands on smaug:

/usr/bin/mount -t webdavfs -o allow_other https://webdav.dandiarchive.org /tmp/dandisets-fuse
/usr/bin/mount -t davfs https://webdav.dandiarchive.org /tmp/dandisets-fuse

Note that the colons in the URLs need to be escaped when adding them to the sudoers file.

Also, follow_redirect in /etc/davfs2/davfs2.conf needs to be set to 1.

yarikoptic commented 6 months ago

done

jwodder commented 6 months ago

@yarikoptic matlab needs to be installed on smaug so that I can benchmark the associated test.

jwodder commented 6 months ago

@yarikoptic Ping.

yarikoptic commented 6 months ago

done now -- the same 2022b version is installed systemwide

jwodder commented 6 months ago

@yarikoptic When I try to run a matlab test on smaug, it fails with:

    License checkout failed.
    License Manager Error -1
    The license file cannot be found.

    Troubleshoot this issue by visiting: 
    https://www.mathworks.com/support/lme/R2022b/1

    Diagnostic Information:
    Feature: MATLAB 
    License path: /home/jwodder/.matlab/R2022b_licenses:/usr/local/MATLAB/R2022b/licenses/license.dat:/usr/local/MATLA
    B/R2022b/licenses 
    Licensing error: -1,359. System Error: 2

Note that there is no /usr/local/MATLAB/R2022b/licenses folder on the server.

yarikoptic commented 6 months ago

could you please give me full matlab invocation to ensure to work correctly? on smaug you do it under your account or some other (like datalad etc)?

jwodder commented 6 months ago

@yarikoptic

matlab -nodesktop -batch 'nwb = nwbRead('"'"'/tmp/dandisets-fuse/000016/sub-mouse1-fni16/sub-mouse1-fni16_ses-161228151100.nwb'"'"')'

where /tmp/dandisets-fuse is a FUSE mount and there is a copy of matnwb in matnwb/ in the current directory (and the envvar MATLABPATH points to this matnwb/). The command is run under my account.

yarikoptic commented 6 months ago

command didn't run under my account on drogon, but worked (errored but past the license check) under dandi so it is user specific somewhere... strace pointed to /home/dandi/.matlab/R2022b_licenses/ ... but can't be just copied since "Your username does not match the username in the license file." .. started matlab's initiator script under VNC on smaug under my login but provided jwodder as the target login, changed permissions for the license... now works for jwodder account (none else, bleh)

jwodder commented 6 months ago

@yarikoptic The benchmarking is failing because the matlab test on FUSE is exceeding the 1-hour timeout. I tried increasing the timeout to 2 hours, but it exceeded that as well. Should I try increasing the timeout to something incredibly high or take another approach?

yarikoptic commented 6 months ago

how long would it run on that file if downloaded in full? if it is just generally very slow (half an hour) -- might be smth to relay to matnwb.

jwodder commented 6 months ago

@yarikoptic 42 seconds

yarikoptic commented 6 months ago

hm. Any ideas on why fuse solution takes that long? how long it takes with datalad-fuse?

jwodder commented 6 months ago

@yarikoptic I don't know why it's so slow with FUSE, and I don't know how long it would take with FUSE, as the benchmark code kills the process at the 2-hour timeout.

yarikoptic commented 6 months ago

please make time out 5 hours and run against both fuse solutions -- datalad-fuse and dandidav + davfs2

yarikoptic commented 6 months ago

ideally: profile datalad-fuse while running the test to see where it spends time.

jwodder commented 6 months ago

@yarikoptic The matnwb test on datalad-fuse exceeded the five-hour time limit as well.

How exactly should I profile it? Just use py-spy?

yarikoptic commented 6 months ago

First - py-spy would not hurt indeed.

Then I would have probably added log lines at DEBUG level within datalad-fuse to see what is actually taking time there if py-spy was not conclusive.

jwodder commented 4 months ago

@yarikoptic Is there a way to get datalad's logs to include timestamps?

yarikoptic commented 4 months ago

yes, there is also a number of other possibly helpful options (available through env vars or even git config since defined in common_cfg) for augmenting logging behavior:

❯ pwd
/home/yoh/proj/datalad/datalad-maint
❯ grep DATALAD_LOG CONTRIBUTING.md
- *DATALAD_LOG_LEVEL*:
- *DATALAD_LOG_NAME*:
- *DATALAD_LOG_OUTPUTS*:
- *DATALAD_LOG_PID*
- *DATALAD_LOG_TARGET*
- *DATALAD_LOG_TIMESTAMP*:
- *DATALAD_LOG_TRACEBACK*:
- *DATALAD_LOG_VMEM*:

jwodder commented 4 months ago

Disregard

@yarikoptic When I try running `datalad -l debug fusefs ...` on smaug with `DATALAD_LOG_TIMESTAMP=1` set, it crashes with: ``` Traceback (most recent call last): File "/bin/datalad", line 33, in sys.exit(load_entry_point('datalad==0.19.5', 'console_scripts', 'datalad')()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/bin/datalad", line 25, in importlib_load_entry_point return next(matches).load() ^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/importlib/metadata/__init__.py", line 202, in load module = import_module(match.group('module')) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/importlib/__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "", line 1206, in _gcd_import File "", line 1178, in _find_and_load File "", line 1128, in _find_and_load_unlocked File "", line 241, in _call_with_frames_removed File "", line 1206, in _gcd_import File "", line 1178, in _find_and_load File "", line 1128, in _find_and_load_unlocked File "", line 241, in _call_with_frames_removed File "", line 1206, in _gcd_import File "", line 1178, in _find_and_load File "", line 1149, in _find_and_load_unlocked File "", line 690, in _load_unlocked File "", line 940, in exec_module File "", line 241, in _call_with_frames_removed File "/usr/lib/python3/dist-packages/datalad/__init__.py", line 112, in cfg = ConfigManager() ^^^^^^^^^^^^^^^ File "/usr/lib/python3/dist-packages/datalad/config.py", line 399, in __init__ self.reload(force=True) File "/usr/lib/python3/dist-packages/datalad/config.py", line 460, in reload self._stores[store_id] = self._reload(runargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3/dist-packages/datalad/config.py", line 488, in _reload stdout, stderr = self._run( ^^^^^^^^^^ File "/usr/lib/python3/dist-packages/datalad/config.py", line 869, in _run out = self._runner.run(self._config_cmd + args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3/dist-packages/datalad/runner/runner.py", line 242, in run raise CommandError( datalad.runner.exception.CommandError: CommandError: 'git --git-dir=/dev/null config -z -l --show-origin' failed with exitcode 1 [err: '/usr/lib/git-annex.linux/git: 2: readlink: not found /usr/lib/git-annex.linux/git: 6: dirname: not found ** cannot find base directory (I seem to be /usr/lib/git-annex.linux/git)'] ``` I don't know why the `datalad` executable at `/bin/datalad` would be executed; when `datalad fusefs` is run, the first item in `PATH` is the `bin` directory in a virtualenv that contains `datalad`, so I would expect that to be run instead.

EDIT: I realized that I wasn't setting the environment for the datalad fusefs command correctly, so PATH et alii were being wiped out.

jwodder commented 4 months ago

@yarikoptic I believe sub-mouse1-fni16/sub-mouse1-fni16_ses-161228151100.nwb in 000016 was a poor choice of asset to test, as it has seemingly always timed out in the normal fusefs tests, and it continues to time out when testing out benchmarking. (Thus, as the timing-out is not specific to the benchmarking, if you want me to investigate it, you should file a separate issue.) Please choose another asset to test the benchmarking on, one that isn't currently marked as timing out.

yarikoptic commented 4 months ago

Let's try on sub-mouse1-fni16/sub-mouse1-fni16_ses-170808184141.nwb in the same dandiset.. if I read yaml correctly it is ok for pynwb and errors out on matnwb (but does not timeout). In general -- feel welcome to choose any asset you deem appropriate and not too "easy" (fast)

jwodder commented 4 months ago

@yarikoptic I finally got a run that didn't time out by using a 44 MB asset from Dandiset 000005. Here's the output, converted to a table:

Mount Type	Dandiset	Asset	Test	Time (s)
fusefs	000005	sub-anm236462/sub-anm236462_ses-20140210_behavior+icephys.nwb	pynwb_open_load_ns	11.972222546115518
fusefs	000005	sub-anm236462/sub-anm236462_ses-20140210_behavior+icephys.nwb	matnwb_nwbRead	447.08360775373876
fusefs	000005	sub-anm236462/sub-anm236462_ses-20140210_behavior+icephys.nwb	dandi_ls	21.88728654384613
davfs2	000005	sub-anm236462/sub-anm236462_ses-20140210_behavior+icephys.nwb	pynwb_open_load_ns	4.709459913894534
davfs2	000005	sub-anm236462/sub-anm236462_ses-20140210_behavior+icephys.nwb	matnwb_nwbRead	18.62446365132928
davfs2	000005	sub-anm236462/sub-anm236462_ses-20140210_behavior+icephys.nwb	dandi_ls	3.7525609582662582

yarikoptic commented 4 months ago

So davfs2 is much more promising. What are timings for datalad-fuse for the same file? (since that is what we use ATM)

jwodder commented 4 months ago

@yarikoptic Those are the "fusefs" entries.

dandi / dandisets-healthstatus

Add benchmarking of various mounting strategies #67

Codecov Report