Initial design/implementation

yarikoptic commented 1 year ago

on drogon have deployment of https://github.com/datalad/dandisets
on each cron run
- update full hierarchy: run datalad update -d dandisets --follow parentds --how=ff-only -J5 -r -R 1 (with -R 1 we should not get to upgrade zarrs subdatasets for now)
- install/upgrade virtualenv to most recent version of pynwb/hdmf and matnwb (TODO 4 Yarik: provide matlab on drogon@dandi)
- use datalad-fuse over https://github.com/datalad/dandisets superdataset
- in parallel or async fashion operate across dandisets and their assets to check if all .nwb files could be opened, so for each dandiset record status.yaml with following structure

last_run: DATETIME
dandiset_version: STR_HEXSHA
nassets: INT  # total number of assets
versions: DICT(name: version) # of relevant versions, e.g. {'pynwb': 1.0.0, 'hdmf': 2.0.0}
tests:
    - name: STR
      assets_ok: LIST(STR)  # which assets were ok
      assets_nok: LIST(STR)  # which assets have failed

target two tests to start with
- pynwb_open: operate on .nwb file and open using pynwb
- matnwb_nwbRead: operate on .nwb file and open using matnwb
all those .yaml files per each dandiset should be stored in this repo and be similarish to https://github.com/dandi/dandi-api-webshots
- subfolder per dandiset (we might want to get more information per each dandiset)
- in each dandiset folder have subfolder logs with {last_run}_{test.name}_errors.log with output from errored out runs. That subfolder should be git ignored for now
- README.md should have summary table with one row per dandiset with one column per test with number of ok/nok assets per that test

jwodder commented 1 year ago

@yarikoptic

on drogon have deployment of https://github.com/datalad/dandisets

Are you saying that you want to use the pre-existing deployment at /mnt/backup/dandi/dandisets/ or that you want a new clone set up somewhere?

What user should this be set up under? How often should the cronjob run?

Also, I'm not sure if using fuse is going to help much, as the NWB files will still be cached locally.

jwodder commented 1 year ago

@yarikoptic I do not appear to have write permission to this repository, and I cannot fork it because it is empty.

yarikoptic commented 1 year ago

@yarikoptic

on drogon have deployment of https://github.com/datalad/dandisets

Are you saying that you want to use the pre-existing deployment at /mnt/backup/dandi/dandisets/ or that you want a new clone set up somewhere?

new clone, e.g. /mnt/backup/dandi/dandisets-clone/

What user should this be set up under? How often should the cronjob run?

dandi user, e.g. daily

Also, I'm not sure if using fuse is going to help much, as the NWB files will still be cached locally.

through fuse we would cache only needed blocks, not entire nwb. From prior experience -- it would take only a few MBs of cache to open multi GB file. And we want to abstract away ROS3 and/or fsspec -- just have rudimentary open operations, thus -- datalad-fuse sounds like a good fit here.

@yarikoptic I do not appear to have write permission to this repository, and I cannot fork it because it is empty.

you are the boss now! ( I will rename into https://github.com/dandi/dandisets-healthstatus for consistency with older https://github.com/dandi/nwb-healthstatus )

jwodder commented 1 year ago

@yarikoptic Exactly what code should be used for opening NWB files with pynwb and matnwb? For pynwb, should it just be this?

with pywnb.NWBHDF5IO(path, "r", load_namespaces=True) as fp:
    pass

Should load_namespaces=True even be there?

jwodder commented 1 year ago

@yarikoptic Should nassets in status.yaml only count NWB assets or all assets?

yarikoptic commented 1 year ago

@yarikoptic Exactly what code should be used for opening NWB files with pynwb and matnwb? For pynwb, should it just be this?
with pywnb.NWBHDF5IO(path, "r", load_namespaces=True) as fp:
    pass
Should load_namespaces=True even be there?

For starters let's stick to minimal default set of options, as none, thus

with pywnb.NWBHDF5IO(path, "r") as reader:
     nwbfile = reader.read()
assert repr(nwbfile)
assert str(nwbfile)

which mimics our test_pynwb_io but without load_namespaces=True. But WDYT @bendichter -- should it be like this or we should start with load_namespaces=True?

jwodder commented 1 year ago

@yarikoptic

in each dandiset folder have subfolder logs with {last_run}_{test.name}_errors.log with output from errored out runs. That subfolder should be git ignored for now

Exactly what do you mean by "errored out runs"? Failed tests? Update processes that exited nonzero?

yarikoptic commented 1 year ago

@yarikoptic Should nassets in status.yaml only count NWB assets or all assets?

all. eventually I hope we would add tests for zarrs as well.

Exactly what do you mean by "errored out runs"? Failed tests? Update processes that exited nonzero?

note: each execution of the test should be a separate python or matlab run (due to all the pynwb versions schema treatments etc). So it would be for failed tests. for failed updates, after which there is no point to run tests BTW if it fails -- use {last_run}_updates_errors.log.

bendichter commented 1 year ago

I think we should test with load_namespaces=True. Otherwise, this will fail for any NWB file that contains an extension.

jwodder commented 1 year ago

@yarikoptic So you want each test to run in a separate process? Are you saying that you want the output from these processes to just be run together in a single logfile per update per Dandiset?

yarikoptic commented 1 year ago

@yarikoptic So you want each test to run in a separate process?

yes

Are you saying that you want the output from these processes to just be run together in a single logfile per update per Dandiset?

yes

yarikoptic commented 1 year ago

I think we should test with load_namespaces=True. Otherwise, this will fail for any NWB file that contains an extension.

eh, ok. @jwodder let's then start with pynwb_open_load_ns (not just plain pynwb_open) which would have load_namespaces=True ;-)

lawrence-mbf commented 1 year ago

@jwodder I can answer questions you have for MatNWB.

To answer your question for

Exactly what code should be used for opening NWB files with ... matnwb?

Using

nwb = nwbRead('filename.nwb', 'savedir', '.');

should be enough to test. savedir sets the location to save the generated namespace modules and should be cleared before reading every nwb file to cleanly separate generated class environments.

jwodder commented 1 year ago

@lawrence-mbf I assume that nwbRead is imported with from matnwb import nwbRead? What is the meaning of the third argument to that function?

lawrence-mbf commented 1 year ago

@jwodder

I assume that "nwbRead" is imported with "from matnwb import nwbRead"?

There is no such syntax in MATLAB so it will really depend on what FFI you're going to end up using here in your codebase. In MATLAB all you would have to do is have the nwbRead.m script file in your MATLAB path.

What is the meaning of the third argument to that function?

savedir is a keyword argument so it's implied to be paired with the third argument which is just indicating the current MATLAB directory.

jwodder commented 1 year ago

@yarikoptic I was under the impression that there was a Python API for matnwb. How exactly should this test be run?

lawrence-mbf commented 1 year ago

@jwodder I believe the simplest way is to run bash commands through python. Another alternative is the MATLAB engine API for Python

yarikoptic commented 1 year ago

@yarikoptic I was under the impression that there was a Python API for matnwb. How exactly should this test be run?

I thought just via external command execution like matlab -nojvm -nodesktop -noflash testcode.m

jwodder commented 1 year ago

@yarikoptic

What exactly should the contents of testcode.m be? Just @lawrence-mbf's suggestion of nwb = nwbRead('filename.nwb', 'savedir', '.'); (but with '.' replaced with a per-asset temporary directory)?
Can testcode.m be replaced with - to read from standard input?
How exactly should the cronjob update to the latest version of matnwb?

yarikoptic commented 1 year ago

What exactly should the contents of testcode.m be? Just @lawrence-mbf's suggestion of nwb = nwbRead('filename.nwb', 'savedir', '.'); (but with '.' replaced with a per-asset temporary directory)?

Can testcode.m be replaced with - to read from standard input?

How exactly should the cronjob update to the latest version of matnwb?

@lawrence-mbf could you please provide @jwodder with a sample complete usable testcode.m and the way to invoke it with matlab without requiring GUI or interaction, and the preferred way to update matnwb to the most recent released version?

lawrence-mbf commented 1 year ago

@jwodder

In fact, you don't even need to run a script per se to test reading. In bash you could just run:

matlab -batch "nwb = nwbRead('file.nwb', 'savedir' '.');"

though you could replace everything in quotes with just "testcode" if you wish to place it in a function.

matlab -batch "testcode('filename.nwb');"

and testcode.m containing:

function testcode(filename)
nwb = nwbRead(filename, 'savedir', '.');
end

MATLAB will even throw an error code if it errors out.

As for getting the most recently release version, I believe the easiest would be to query Github using this API: https://docs.github.com/en/rest/releases/releases#get-the-latest-release

And grabbing the files that way.

jwodder commented 1 year ago

@lawrence-mbf What if I want to pass the NWB path and savedir path as extra arguments on the command line rather than having to mess with interpolation into a MATLAB code string? How would I do that?

And grabbing the files that way.

And then what?

jwodder commented 1 year ago

@yarikoptic Should datalad-fuse mount the FUSE directory in a specific path? Is it safe to mount FUSE at a temporary directory under /tmp?

lawrence-mbf commented 1 year ago

@jwodder

What if I want to pass the NWB path and savedir path as extra arguments on the command line ...

Then I suppose you would have to use a config file that you should export in the same directory as the script. Something like this testcode.zip

Inside the above zip file is a testcode.m file which should be run without arguments (see below). In the same directory should be the testcode_config file which is just an ASCII file with two lines of data (separated by newlines) in order of: 1) the path to the file 2) the savedir directory.

There's an example testcode_config provided. Hopefully that works around the need to interpolate strings.

And then what?

Assuming you have MATLAB installed you would: 1) unzip the files in some directory 2) add the directory to the MATLABPATH environment variable 2) call the command line command matlab -batch "testcode" -sd <testcode-dir>.

yarikoptic commented 1 year ago

@yarikoptic Should datalad-fuse mount the FUSE directory in a specific path?

May be let's use some "project" top level directory like /mnt/backup/dandi/dandisets-healthstatus/ where we would have dandisets (the aforementioned clone of all dandistes), and dandisets-fuse mount point to be reused.

Is it safe to mount FUSE at a temporary directory under /tmp?

it might work but probably best to keep things near.

BTW -- matlab is now available on drogon

(base) dandi@drogon:~$ matlab -nojvm -nodesktop
MATLAB is selecting SOFTWARE OPENGL rendering.

                                                     < M A T L A B (R) >
                                           Copyright 1984-2022 The MathWorks, Inc.
                                      R2022b Update 1 (9.13.0.2080170) 64-bit (glnxa64)
                                                     September 28, 2022

For online documentation, see https://www.mathworks.com/support
For product information, visit www.mathworks.com.

>>

jwodder commented 1 year ago

I can't get matnwb to work. I've downloaded & unpacked the source code for the latest release in matnwb-2.5.0.0. If I try running:

export MATLABPATH="$PWD"/matnwb-2.5.0.0
matlab -nojvm -nodesktop -r 'matnwb_nwbRead sub-RAT123.nwb foo'

where matnwb_nwbRead.m contains:

function []=matnwb_nwbRead(a,b)
nwb = nwbRead(a, 'savedir', b);
end

then I get:

Unable to resolve the name 'types.core.Subject'.

Error in io.parseGroup (line 85)
    parsed = eval([Type.typename '(kwargs{:})']);

Error in io.parseGroup (line 38)
    subg = io.parseGroup(filename, group, Blacklist);

Error in io.parseGroup (line 38)
    subg = io.parseGroup(filename, group, Blacklist);

Error in nwbRead (line 59)
nwb = io.parseGroup(filename, h5info(filename), Blacklist);

Error in matnwb_nwbRead (line 2)
nwb = nwbRead(a, 'savedir', b);

If I try running the commands listed here, then at the generateCore() step I get:

Unable to resolve the name
'com.mathworks.jmi.ClassLoaderManager.getClassLoaderManager'.

Error in javaclasspath>local_javapath (line 137)
jloader = com.mathworks.jmi.ClassLoaderManager.getClassLoaderManager;

Error in javaclasspath (line 65)
      p = local_javapath('-dynamic');

Error in javaaddpath (line 71)
  javaclasspath( p, javaclasspath );

Error in spec.loadSchemaObject (line 9)
    javaaddpath(java_loc);

Error in spec.generate (line 6)
Schema = spec.loadSchemaObject();

Error in generateExtension (line 47)
    Namespaces = spec.generate(namespaceText, localpath);

Error in generateCore (line 51)
    generateExtension(commonPath, varargin{:});

jwodder commented 1 year ago

@yarikoptic

update full hierarchy: run datalad update -d dandisets --follow parentds --how=ff-only -J5 -r -R 1

That command fails with "unknown argument: -J5".

lawrence-mbf commented 1 year ago

@jwodder Please remove -nojvm, MatNWB does leverage the JVM for YAML parsing.

jwodder commented 1 year ago

@yarikoptic Exactly what command do I run to create the clone of all Dandisets without cloning any Zarrs? Note that just datalad clone ... is insufficient, as it leaves the subdatasets empty, and thus they remain empty in the FUSE mount.

yarikoptic commented 1 year ago

should be able to use datalad install -r -R 1 BUT I think we better

stick to that datalad clone which doesn't do recursive
after datalad update do datalad get -d . -r -R 1 -J5 -n to ensure that all dandisets are installed, since update would not install new subdatasets anyways IIRC -- would only update the ones already installed.

yarikoptic commented 1 year ago

update full hierarchy: run datalad update -d dandisets --follow parentds --how=ff-only -J5 -r -R 1

That command fails with "unknown argument: -J5".

sorry about that -- apparently not yet implemented :-/ https://github.com/datalad/datalad/issues/6351 so no -J for update

jwodder commented 1 year ago

@yarikoptic datalad-fuse keeps crashing with errors of the form:

Uncaught exception from FUSE operation open, returning errno.EINVAL.
Traceback (most recent call last):
  File "/home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/fuse.py", line 734, in _wrapper
    return func(*args, **kwargs) or 0
  File "/home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/fuse.py", line 834, in open
    fi.fh = self.operations('open', path.decode(self.encoding),
  File "/home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/datalad_fuse/fuse_.py", line 75, in __call__
    return super(DataLadFUSE, self).__call__(op, self.root + path, *args)
  File "/home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/fuse.py", line 1076, in __call__
    return getattr(self, op)(*args)
  File "/home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/datalad_fuse/fuse_.py", line 203, in open
    fsspec_file = self._adapter.open(path)
  File "/home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/datalad_fuse/fsspec.py", line 243, in open
    return dsap.open(relpath, mode=mode, encoding=encoding, errors=errors)
  File "/home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/datalad_fuse/fsspec.py", line 169, in open
    return self.fs.open(url, mode, **kwargs)
  File "/home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/fsspec/implementations/cached.py", line 406, in <lambda>
    return lambda *args, **kw: getattr(type(self), item).__get__(self)(
  File "/home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/fsspec/spec.py", line 1034, in open
    f = self._open(
  File "/home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/fsspec/implementations/cached.py", line 406, in <lambda>
    return lambda *args, **kw: getattr(type(self), item).__get__(self)(
  File "/home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/fsspec/implementations/cached.py", line 345, in _open
    self.save_cache()
  File "/home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/fsspec/implementations/cached.py", line 406, in <lambda>
    return lambda *args, **kw: getattr(type(self), item).__get__(self)(
  File "/home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/fsspec/implementations/cached.py", line 161, in save_cache
    cached_files = pickle.load(f)
EOFError: Ran out of input

yarikoptic commented 1 year ago

please file/clarify with https://github.com/fsspec/filesystem_spec/issues , smells like a bug on their end. Is that file entirely empty or has smth? is it open by some process still writing to it etc.

jwodder commented 1 year ago

@yarikoptic Issue filed: https://github.com/fsspec/filesystem_spec/issues/1107

yarikoptic commented 1 year ago

THANKS! You wrote

I suspect the cause is that the cached filesystem is not safe for concurrent access, with the result that multiple actions on the filesystem are causing the cache file to be read by one procedure while another writes to it.

I see that we are already doing some thread locking in https://github.com/datalad/datalad-fuse/blob/HEAD/datalad_fuse/fuse_.py#L214 but I guess, as a workaround (unless fixed in fsspec quickly) we might need to add that locking in the else: part there at https://github.com/datalad/datalad-fuse/blob/HEAD/datalad_fuse/fuse_.py#L221 ?

jwodder commented 1 year ago

@yarikoptic I don't think putting a lock just there would be enough. We'd likely have to wrap a lock around every operation that goes through fsspec.

yarikoptic commented 1 year ago

ok, let's give fsspec folks a day but then we would need to try it out to progress forward. It seems that there is not that many of such code points anyways.

jwodder commented 1 year ago

@yarikoptic I just finished running the healthchecks using https://github.com/datalad/datalad-fuse/pull/83 and https://github.com/fsspec/filesystem_spec/pull/1111. The good news is that there was only one error, caused by an HTTP problem, and it didn't seem to affect anything. Also, all the tests passed (and I know the test-handling is correct because some datalad-fuse issues caused some test failures in a previous run). The bad news is that the script took over three days to run, largely because iterating over a datalad-fuse directory involves the kernel making a stat(2) call on each file, which means a getattr call, which means making an HTTP request for each file.

yarikoptic commented 1 year ago

This is indeed good news and awesome! Thank you @jwodder !

https://github.com/fsspec/filesystem_spec/pull/1111 -- lets finalize it! (docstring + take out of the draft)
https://github.com/datalad/datalad-fuse/pull/83 -- same there, left a comment
how much data was cached locally?
3 day time run is indeed "suboptimal", and it is great that you pin pointed the main slow down. Filed a dedicated https://github.com/datalad/datalad-fuse/issues/84 for it.

Let's consider this issue resolved, thanks again @jwodder, and I filed dedicated #2 and #4 for the next steps on this one.

jwodder commented 1 year ago

@yarikoptic

how much data was cached locally?

The total size of all */.git/datalad/cache directories in the dandi/dandisets clone is 949,177,524 kB.

yarikoptic commented 1 year ago

@lawrence-mbf is there some more lightweight way to do basic "open nwb testing" with matnwb? In our preliminary testing in https://github.com/dandi/dandisets-healthstatus/pull/3#issuecomment-1324029002 nwbRead might be too heavy to allow for scalable testing.

lawrence-mbf commented 1 year ago

Not currently. It might be worth profiling to see what the bottleneck here is. I'm not familiar with what FUSE is but if it's using ROS3 it's most likely that driver.

yarikoptic commented 1 year ago

that timing, if I got it right, was on a fully downloaded file... a sizeable though (11GB). It is just that it indeed takes that while!

lawrence-mbf commented 1 year ago

I'll take a look if something can be done.

dandi / dandisets-healthstatus

Initial design/implementation #1