[Feature]: Add `read_nwb` to simplify reading nwbfiles

h-mayorquin commented 3 weeks ago

What would you like to see added to PyNWB?

This is a comment that I have gotten from various users: reading and nwbfile is not as easy as it could be.

At the moment, when I google "how to read and nwbfile", the results show a suggestion to use NWBHDF5IO. This is also what we have on the tutorial.

While the IO approach is flexible, powerful, and follows good practices (i.e., it can be used as a context manager) it also suffers from what I think are several problems that detract from usability: 1) It has too many options: path, file, region, mode, manager, etc. 2) It is hard to remember the name to import. 3) It does not read zarr. 4) It requires two method calls, first instantiating the IO and then using read on the IO.

What solution would you like?

I think that we could provide users with a simple entry point that covers most cases and leave the IO approach as the power tool behind the scenes:

from pynwb import read_nwb

nwbfile = read_nwb(file_path)

I think the name is easier to remember and intuitive about what it does.

For simplicity, this function should:

Work only paths
Always on read mode
Open both zarr and hdf5.
load_namespaces=True
If we are ambitious we can have a default streaming choice if a link to dandi s3 asset is passed.

Downsides?

Do you have any interest in helping implement the feature?

Yes.

Code of Conduct

[X] I agree to follow this project's Code of Conduct
[X] Have you checked the Contributing document?
[X] Have you ensured this change was not already requested?

stephprince commented 3 weeks ago

I like this idea! For me the main downside would be any potential confusion for users about having multiple methods to read a file and when to use read_nwb vs. NWBHDF5IO, but I think this could be clarified in the docs.

@oruebel @rly @bendichter thoughts?

oruebel commented 3 weeks ago

potential confusion for users about having multiple methods to read a file and when to use read_nwb vs. NWBHDF5IO

I would make this part of io/utils.py to also make it clear on import that this is a convenience function. In addition, I think we want to be clear about the scoping of the function . Ie., folks will ask for the function to do everything, i.e, be able to open any NWB file and be able to configurabl all possible options. I think we want to limit scope the function to be: 1) read only and 2) have only the filename as a parameter and not support additional configurations. I.e., have it as a convenience function for opening a file for just read. As such NWBHDF5IO would still be used in most of our tutorials, since many tutorials show how to write data and read_io would be used mainly in read tutorials. Also, just scoping for read will be challenging enough since the function will need to work with:

both HDF5 and Zarr and return ~~either NWBHDF5IO or NWBZarrIO~~ an NWBFile object, which uses either NWBHDF5IO or NWBZarrIO for read (i.e., NWBFile.get_read_io() may be either NWBHDF5IO or NWBZarrIO)
S3 paths to open a files on DANDI
know when to load namespaces (or default to always load namespaces?)
...

stephprince commented 3 weeks ago

Here an example implementation of a read_nwb method that Cody made for NWBInspector that might be useful as a reference. And if we do incorporate streaming, here is a related proposal on a default function for remote read: https://github.com/NeurodataWithoutBorders/pynwb/issues/1739

Also, just scoping for read will be challenging enough since the function will need to work with: both HDF5 and Zarr and return either NWBHDF5IO or NWBZarrIO

If I’m understanding the initial proposal, with this convenience method we would not return an IO object but just the NWBFile object?

oruebel commented 3 weeks ago

If I’m understanding the initial proposal, with this convenience method we would not return an IO object but just the NWBFile object?

Sorry, you are correct, read_nwb should always return the NWBFile (not the IO object). I amended my comment above accordingly. However, the I/O object is still accessible via the NWBFile object via NWBFile.get_read_io(), i.e.,:

nwbfile = read_nwb('myfile.nwb')   # read HDF5
nwbfile.get_read_io()   # this returns NWBHDF5IO

nwbfile = read_nwb('myfile.zarr.nwb')  # read Zarr.
nwbfile.get_read_io()   # this returns NWBZarrIO

magland commented 2 weeks ago

I like this idea, but I would hope that it would also support passing in an h5py object instead of just a file path or an s3 path. That would allow streaming. For what I do, I am almost always streaming from a remote location. Passing in the h5py object provides a lot of flexibility on how this would be done (streaming method, authentication, etc).

h-mayorquin commented 2 weeks ago

I like this idea, but I would hope that it would also support passing in an h5py object instead of just a file path or an s3 path. That would allow streaming. For what I do, I am almost always streaming from a remote location. Passing in the h5py object provides a lot of flexibility on how this would be done (streaming method, authentication, etc).

My view is that the intended audience of this function is users without a lot of experience. I would prefer to keep the signature and the input types very simple and defer more complex use cases like yours to their respective backend IOs.

I am coming at this from the angle that if the format is successful reading will be way more common than writing, so I just want to have a top level method that is easier to remember, use and covers 95 % of the cases for reading files.

h-mayorquin commented 2 weeks ago

To make this more concrete, this is what I am envisioning as the signature, input types and docstring:

from pathlib import Path
from pynwb import NWBFile

def read_nwb(path: str | Path) -> NWBFile:
    """Read an NWB file from a local path or remote URL.

    Provides a simple, high-level interface for reading NWB files in the most 
    common use cases. Automatically handles both HDF5 and Zarr formats.
    For advanced use cases (parallel I/O, custom namespaces), use NWBHDF5IO or NWBZarrIO.

    Parameters
    ----------
    path : str or pathlib.Path
        Path to the NWB file. Can be either a local filesystem path to an HDF5 (.nwb) 
        or Zarr (.zarr) file, or a remote URL (e.g., DANDI S3 asset URL).

    Returns
    -------
    pynwb.NWBFile
        The loaded NWB file object containing all datasets and metadata.

    See Also
    --------
    pynwb.NWBHDF5IO : Core I/O class for HDF5 files with advanced options.
    hdmf_zarr.nwb.NWBZarrIO : Core I/O class for Zarr files with advanced options.

    Notes
    -----
    This function uses the following defaults:
    * Always opens in read-only mode
    * Automatically loads namespaces
    * Detects file format based on extension
    * Automatically handles local and remote paths

    Advanced features requiring direct use of IO classes include:
    * Custom namespace extensions
    * Parallel I/O with MPI
    * Custom build managers
    * Write or append modes
    * Pre-opened HDF5 file objects or Zarr stores
    * Remote file access configuration

    Examples
    --------
    Read a local NWB file:

    >>> from pynwb import read_nwb
    >>> nwbfile = read_nwb("path/to/file.nwb")

    Read from a remote DANDI asset:

    >>> nwbfile = read_nwb("s3://dandiarchive/.../file.nwb")
    """
    pass  # Implementation details would go here

magland commented 2 weeks ago

I see your point, but I feel like allowing the argument to be an h5py object as well wouldn't complicate things very much. This would really simplify all my neurosift scripts, so I hope this could be considered.

Regarding s3 url, I think it's important to also support https://api.dandiarchive.org/... URLs Would be nice to allow any https://... so it's not limited to dandi.

oruebel commented 2 weeks ago

This would really simplify all my neurosift scripts, so I hope this could be considered.

@magland could you add a brief example to illustrate how read_nwb would simplify your scripts vs. using NWBHDF5IO? If you are passing in an open h5py object, then I think either should look very similar, but I'm probably missing something.

magland commented 2 weeks ago

@oruebel

f = lindi.LindiH5pyFile.from_lindi_file(url)
io = pynwb.NWBHDF5IO(file=f, mode='r')
nwbfile = io.read()

would become

f = lindi.LindiH5pyFile.from_lindi_file(url)
nwbfile = read_nwb(f)

So it saves one line... but it becomes a lot simpler to read.

oruebel commented 2 weeks ago

So it saves one line... but it becomes a lot simpler to read.

Thanks @magland for the clarification.

I think allowing a file as input instead of path would be OK, since it doesn't add any parameters to the interface. We may want to restrict it to h5py.File objects that are in read-only mode (or at least warn if they are not), since otherwise you might get a write-able file instead of a read-only file.

Just a thought, now that the io object is being stored on the NWBFile object. We could add static read_nwb methods on each backend, i.e., nwbfile = NWBHDF5IO.read_nwb(). This would help simplify the logic of the utility method read_nwb, since it would just call NWBHDF5IO.read_nwb() or NWBZarrIO.read_nwb() and then each backend could add additional logic. E.g., NWBHDF5IO.read_nwb() could accept and h5py.File instead of a path while keeping the interface of read_nwb simple to allow only path. I.e., in this case the lindi example would change to:

f = lindi.LindiH5pyFile.from_lindi_file(url)
nwbfile = NWBHDF5IO.read_nwb(f)

@h-mayorquin @magland what do you think? Would adding NWBHDF5IO.read_nwb() and NWBZarrIO.read_nwb() in addition to read_nwb make things simpler or worse?

magland commented 2 weeks ago

@oruebel That makes sense to me. I think if we had NWBHDF5IO.read_nwb(f) that supported h5py objects, then I wouldn't need the read_nwb utility, which could be reserved for the simplest case.

h-mayorquin commented 2 weeks ago

This would help simplify the logic of the utility method read_nwb, since it would just call NWBHDF5IO.read_nwb() or NWBZarrIO.read_nwb() and then each backend could add additional logic. E.g., NWBHDF5IO.read_nwb() could accept and h5py.File instead of a path while keeping the interface of read_nwb simple to allow only path.

This is great. Another advantage is that the reading code is in its "natural" place: the backend IO object. read_nwb is then just a router over those new methods.

oruebel commented 2 weeks ago

Sounds like we have a plan:

PyNWB

Add NWBHDF5IO.read_nwb() which should accept: 1) local path, 2) S3 path, or 3) an open h5py file.
Add pynwb.read_nwb() which accepts a file path and calls the appropriate backend-specific read_nwb method
Update the tutorial on reading NWB files https://pynwb.readthedocs.io/en/latest/tutorials/general/plot_read_basics.html#sphx-glr-tutorials-general-plot-read-basics-py

HDMF_ZARR

Add NWBZarrIO.read_nwb() which should accept: 1) local path, 2) S3 path, or 3) an open Zarr store.

HDMF / HDMF_ZARR / PyNWB

We may also need to refine the logic in HDF5IO.can_read() and ZarrIO.can_read() (or overwrite the methods in NWBHDF5IO and NWBZarrIO) to handle S3 paths correctly.

oruebel commented 2 weeks ago

I created https://github.com/hdmf-dev/hdmf-zarr/issues/225 for this and can take a stab at the changes needed in HDFM_ZARR.

@h-mayorquin would you be interested in taking a stab at the necessary changes in PyNWB?

h-mayorquin commented 2 weeks ago

Yes, I can take a look at it!

oruebel commented 1 week ago

@h-mayorquin the PR for adding NWBZarrIO.read_nwb for the Zarr backend is here https://github.com/hdmf-dev/hdmf-zarr/pull/226

h-mayorquin commented 1 week ago

Here is what I think is the analogous PR on pynwb (first part): https://github.com/NeurodataWithoutBorders/pynwb/pull/1979

oruebel commented 1 week ago

@h-mayorquin one item to add to the ToDo list is that we should also update the tutorial on reading NWB files https://pynwb.readthedocs.io/en/latest/tutorials/general/plot_read_basics.html#sphx-glr-tutorials-general-plot-read-basics-py to describe read_nwb for opening a file for reading.

h-mayorquin commented 1 week ago

Sounds good, I can take care of that in the PR for pynwb.read_nwb().

oruebel commented 6 days ago

@h-mayorquin https://github.com/hdmf-dev/hdmf-zarr/pull/226 for adding NWBZarrIO.read_nwb has been merged

h-mayorquin commented 6 days ago

OK, now that the PyNWB part has been merged https://github.com/NeurodataWithoutBorders/pynwb/pull/1979 I think I can work on the second part:

Add pynwb.read_nwb() which accepts a file path and calls the appropriate backend-specific read_nwb method

And updating the tutorial.

h-mayorquin commented 2 days ago

I added pynwb.read_nwb for local files:

https://github.com/NeurodataWithoutBorders/pynwb/pull/1994

I would wait for one of our meetings to discuss support for remote_paths in pynwb._read_nwb.

NeurodataWithoutBorders / pynwb