dandi / dandisets-healthstatus

Healthchecks of dandisets and support libraries (pynwb and matnwb)
0 stars 1 forks source link

Implement MatNWB test #3

Closed jwodder closed 1 year ago

jwodder commented 1 year ago

Closes #2.

yarikoptic commented 1 year ago

I assume you have tried it -- does it pass on any dandiset files ATM?

jwodder commented 1 year ago

@yarikoptic I have not tried it, as I was waiting to complete a run of the pynwb tests first.

yarikoptic commented 1 year ago

ok, lets then just merge and fix up what is needed

jwodder commented 1 year ago

@yarikoptic After adding the MatNWB test, the script has been running for over 20 hours and is still on the first Dandiset (000003, which previously took only a couple of minutes).

yarikoptic commented 1 year ago

heh, that is indeed non-scaleable. But it is unlikely due to locking -- how long it takes for a single pynwb and matnwb test to run on some sample file in that dandiset with full content available or via fuse? I want to assess impact of each factor (library, fuse) to see what needs to be improved.

jwodder commented 1 year ago

@yarikoptic Do you want me to perform those measurements while the healthcheck script is running and Drogon is overloaded beyond 60, or should I kill the script first?

yarikoptic commented 1 year ago

yeah , good point -- I killed /mnt/backup/dandi/dandisets/tools/backups2datalad-populate-zarrs-cron (I think its processes did complete) and there is now only populate on regular assets running -- might want to wait for it or not. Indeed better to kill healthcheck currently running since we are not to live that long to see its end.

FWIW I also py-spy recorded a sample period within that datalad fuse-fs and both times it looked like (and that was seconds!!)

fuse-flame2

which is odd why every one of those steps taking so long, e.g. for release and even read if that (whatever it was) file was accessed before (during pynwb run) and thus cached locally? Can number of files in the local cache/opened by the process (>100) have impact/effect on that?

FWIW a note is that most of the cache files are opened twice by the fuse process

(dandisets) dandi@drogon:/mnt/backup/dandi/dandisets/000108$ ls -l /proc/3433101/fd | awk '/0003.*cache/{print $11}'  | sort | uniq -c | sort -n | tail
      2 /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/c70bb5cb067563c34bc64f9fb30ac1e7539261f5fa63ebfcccdd6caeed0b6b6e
      2 /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cd6360468b592ab3e0ad3decda0ead2f86d95256959f8d5c254e97df5ec4ffa4
      2 /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/d9c1bf061935ba6755dd6022ce75ed4d9439870643d3b2de35e014f31957149d
      2 /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/df19691e335b9eb36a8a08c80c45660534735da087819cba85ec431a47b75a2b
      2 /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/e59d4b30a58794e5e78fa62b727003b05da5030a49eb313904dd72b6d4848ddb
      2 /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/ed9a71ec7b699e115aa6edcb876fbdca4a7f743ae9be5d7c64e0d69b1c196e12
      2 /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/f05497f2f9a0336caed9fdc44aa77ddaa72ba204fa56d8778dfe5aa41a27aceb
      2 /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/f3a675d01534a76ac95a69070f0cf16cae144fca478a6ff63cbb69fd95227b57
      2 /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/f6cd7c1c35eb3e7fed5342373d3e9852389b51a64e78ce1adf20970098461b39
      2 /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/fe5d2fbc846ac6ea56d052905cb6851c9c176e44ca829ea9c2d73a960d835283
(dandisets) dandi@drogon:/mnt/backup/dandi/dandisets/000108$ ls -l /proc/3433101/fd | awk '/0003.*cache/{print $11}'  | sort | uniq -c | sort -n | head
      1 /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/6ba1b2e907320757a6c11f0053e5b8de7ba6ef121ecb6f5ef3377cf78aa695fe
      1 /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/74c5dcd30d9268279a91d99b6c59203fe1ae9b54cdb5895d5f43cfa636971a31
      1 /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/ec8a3fd2ea860860c9624310cda941c538436d73cfe6ec4b7d8c48fbe0f416a8
      2 /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/0152ee14905402995c9f82ed31dbc134e5a4f15ce2930172817232ac521d4a9a
      2 /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/05f594410a7327f244e7653ea4f6b398c0b51cacdb9a5f74c53ed75b4cd3811c
      2 /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/0a1fd7f90f55831317cdbdb3ec36de43791888d7ed8a0c7017a8e613e19cb7fb
      2 /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/0abd98522a8c2b281cdb993c823ab46cbeab7ade4c9957c0a72271f19a3eda65
      2 /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/0e721289b622d0a9d39aad55d7812b7ea8178096e0f1afa5516cf6b48caf7f6f
      2 /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/119ac0b0d1c1057a6ebef6a36f9d5af0207a9caa305fac25c60d02479b0f4c9c
      2 /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/14162645a31ccefe2ad1515b8781470e1a0868c0fd92483fd055247868ec8379
jwodder commented 1 year ago

@yarikoptic Running the tests directly on a download of sub-YutaMouse20/sub-YutaMouse20_ses-YutaMouse20-140321_behavior+ecephys.nwb in Dandiset 000003 (size: 11,694,713,998 bytes, chosen simply because it was the first asset) took 3.284s for the PyNWB test and 51.135s for the MatNWB test. Running the test again immediately afterwards took 2.567s for PyNWB and 16.746s for MatNWB.

yarikoptic commented 1 year ago

ok, and what about from fuse-mounted location for that file?

edit: so -- matlab test is inherently > 10 times slower than Python one. If took 3 days for python one to get through all dandisets, would take month for matlab :-/

yarikoptic commented 1 year ago

btw, where do you have that sub-YutaMouse20/sub-YutaMouse20_ses-YutaMouse20-140321_behavior+ecephys.nwb on drogon?

jwodder commented 1 year ago

@yarikoptic Running the tests on that file via FUSE took 3.521s for PyNWB and 42.548s for MatNWB. Running the tests on a different FUSEd file that hadn't been downloaded to the dataset (size: 15,090,018,364 bytes) took 6.229s for PyNWB and ... the MatNWB test is still running after more than ten minutes.

The file is in the Dandisets clone, at /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/sub-YutaMouse20/sub-YutaMouse20_ses-YutaMouse20-140321_behavior+ecephys.nwb, fetched with datalad get.

jwodder commented 1 year ago

@yarikoptic Final time for the second MatNWB test was 19m25.628s.

yarikoptic commented 1 year ago

@yarikoptic Final time for the second MatNWB test was 19m25.628s.

and upon rerun which would not need fetching the data -- same-ish under minute timing?

overall suggesting that

lawrence-mbf commented 1 year ago

FYI, I'm running into issues installing h5py but after I figure that out I will do an A2A read on that file using pynwb for comparison. So far with MatNWB I'm getting ~4-5 second reads from MATLAB startup (so no JIT optimizations). Without A2A I wouldn't read too deeply into this, however.

yarikoptic commented 1 year ago

ignorant me: what is A2A? ;-) Let us know if you need any help with h5py/pynwb installation. The easiest should be conda distribution. If you have any python3, we could give you a 1 liner using https://github.com/datalad/datalad-installer/blob/master/src/datalad_installer.py which must ( ;-) ) work!

lawrence-mbf commented 1 year ago

A2A being apples to apples comparison. It's possible that my workstation is simply absurdly fast and pynwb will run in 0.3 seconds so it wouldn't be useful comparing with the other times posted in this thread.

lawrence-mbf commented 1 year ago

So yeah, it takes 0.11 seconds for the equivalent python script to run. That fits with what has been observed with a 10x slowdown.

I believe the major difference is that MatNWB does actually write a lot more to disk since it's generating the class files (generateSpec is >50% of the processing time). Running with the ignorecache flag lowers my processing time from 4-5 seconds to ~2 seconds.

If file writing is removed, the slowest part of MatNWB is the type validation which requires occasionally sampling from the file.

jwodder commented 1 year ago

@yarikoptic Final time for the second MatNWB test was 19m25.628s.

and upon rerun which would not need fetching the data -- same-ish under minute timing?

No. I ran the tests twice again, and the MatNWB times were 11m23.130s and 10m44.171s.

yarikoptic commented 1 year ago

If file writing is removed,

How can we do that?

lawrence-mbf commented 1 year ago

@yarikoptic

If file writing is removed,

How can we do that?

If you use the ignorecache argument option (i.e. nwb = nwbRead('file.nwb', 'ignorecache')) MatNWB will not generate class files from the embedded spec in the NWB file. Of course this will render the file virtually unreadable if the file contains extensions of any sort or a version of the schema that's incompatible with the currently generated one.

yarikoptic commented 1 year ago

Of course this will render the file virtually unreadable if the file contains extensions of any sort or a version of the schema that's incompatible with the currently generated one.

@jwodder -- let's keep current "writing" version but use /mnt/fast/dandi/dandisets-healthstatus as the location where that writing would happen -- may be main storage spindles are indeed slowing it down a little.

It's possible that my workstation is simply absurdly fast and pynwb will run in 0.3 seconds so it wouldn't be useful comparing with the other times posted in this thread So yeah, it takes 0.11 seconds for the equivalent python script to run.

@lawrence-mbf you were timing on sub-YutaMouse20/sub-YutaMouse20_ses-YutaMouse20-140321_behavior+ecephys.nwb in Dandiset 000003 (size: 11,694,713,998 bytes) correct? what was your script that it was that fast? ;-)

a slightly modified script to print more timing ```shell (pynwb-recent) dandi@drogon:/mnt/backup/dandi/dandisets-healthstatus/dandisets/000003$ cat /tmp//pynwb_open_load_ns.py import sys import pynwb from time import time t0 = time() with pynwb.NWBHDF5IO(sys.argv[1], "r", load_namespaces=True) as reader: nwbfile = reader.read() print(f"read time {time()-t0}") assert repr(nwbfile) assert str(nwbfile) print(f"full time {time()-t0}") ```

showed

So I did timing of matlab loading cycle,

(dev3) yoh@typhon:/tmp$ MATLABPATH=/home/yoh/proj/dandi/matnwb /usr/bin/time matlab -nodesktop -batch "tic; nwb = nwbRead('/home/yoh/proj/dandi/dandisets/000003/sub-YutaMouse20/sub-YutaMouse20_ses-YutaMouse20-140321_behavior+ecephys.nwb', 'savedir', '/tmp/'); toc"
Elapsed time is 4.351224 seconds.
10.36user 1.11system 0:10.14elapsed 113%CPU (0avgtext+0avgdata 1038872maxresident)k
0inputs+1392outputs (9major+226900minor)pagefaults 0swaps

(dev3) yoh@typhon:/tmp$ MATLABPATH=/home/yoh/proj/dandi/matnwb /usr/bin/time matlab -nodesktop -batch "tic; nwb = nwbRead('/home/yoh/proj/dandi/dandisets/000003/sub-YutaMouse20/sub-YutaMouse20_ses-YutaMouse20-140321_behavior+ecephys.nwb', 'ignorecache'); toc"
Elapsed time is 1.980811 seconds.
6.67user 1.21system 0:07.80elapsed 100%CPU (0avgtext+0avgdata 928380maxresident)k
0inputs+112outputs (14major+180020minor)pagefaults 0swaps

which confirms your timings! and on drogon indeed quite slower

(pynwb-recent) dandi@drogon:~/proj/dandi$ MATLABPATH=/home/dandi/proj/dandi/matnwb /usr/bin/time matlab -nodesktop -batch "tic; nwb = nwbRead('/mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/sub-YutaMouse20/sub-YutaMouse20_ses-YutaMouse20-140321_behavior+ecephys.nwb', 'savedir', '/tmp/'); toc"
Elapsed time is 14.932822 seconds.
16.91user 0.88system 0:22.98elapsed 77%CPU (0avgtext+0avgdata 951240maxresident)k

cold and then warm getting faster but still twice slower than on tython:

(pynwb-recent) dandi@drogon:~/proj/dandi$ MATLABPATH=/home/dandi/proj/dandi/matnwb /usr/bin/time matlab -nodesktop -batch "tic; nwb = nwbRead('/mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/sub-YutaMouse20/sub-YutaMouse20_ses-YutaMouse20-140321_behavior+ecephys.nwb', 'savedir', '/tmp/'); toc"
Elapsed time is 8.116742 seconds.
16.68user 0.82system 0:16.02elapsed 109%CPU (0avgtext+0avgdata 940284maxresident)k
0inputs+1392outputs (8major+166313minor)pagefaults 0swaps

with ignorecache getting half of that time

(pynwb-recent) dandi@drogon:~/proj/dandi$ MATLABPATH=/home/dandi/proj/dandi/matnwb /usr/bin/time matlab -nodesktop -batch "tic; nwb = nwbRead('/mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/sub-YutaMouse20/sub-YutaMouse20_ses-YutaMouse20-140321_behavior+ecephys.nwb', 'ignorecache'); toc"
Elapsed time is 3.457994 seconds.
10.20user 0.74system 0:10.90elapsed 100%CPU (0avgtext+0avgdata 879004maxresident)k
jwodder commented 1 year ago

@yarikoptic

@jwodder -- let's keep current "writing" version but use /mnt/fast/dandi/dandisets-healthstatus as the location where that writing would happen -- may be main storage spindles are indeed slowing it down a little.

Do you just mean that the savedir for the MatNWB tests should be changed to that folder instead of using a temporary directory?

lawrence-mbf commented 1 year ago

@yarikoptic

you were timing on sub-YutaMouse20/sub-YutaMouse20_ses-YutaMouse20-140321_behavior+ecephys.nwb in Dandiset 000003 > (size: 11,694,713,998 bytes) correct? what was your script that it was that fast? ;-)

Assuming you meant pynwb it's similar to your script but with time.perf_counter instead. Not sure if that's a valid benchmark timer.