NeurodataWithoutBorders / pynwb

A Python API for working with Neurodata stored in the NWB Format
https://pynwb.readthedocs.io
Other
176 stars 85 forks source link

[Feature]: Chunk all TimeSeries data and timestamps by default #1945

Open rly opened 2 months ago

rly commented 2 months ago

What would you like to see added to PyNWB?

Chunking generally improves read/write performance and is more cloud-friendly (and LINDI-friendly). (Related to https://github.com/NeurodataWithoutBorders/lindi/pull/84.)

I suggest that TimeSeries.__init__ wraps data and timestamps with a H5DataIO or ZarrDataIO, depending on backend, with chunks=True, if the input data/timestamps are not already wrapped. We can add flags to say chunk_data=True, chunk_timestamps=True that users can flip to turn off this behavior. A challenge will be figuring out the backend within TimeSeries...

We could use the h5py defaults for now, and more targeted defaults for ElectricalSeries data / TwoPhotonSeries data later.

I believe all Zarr data are already chunked by default.

Is your feature request related to a problem?

Contiguous HDF5 datasets have slow performance in the non-contiguous dimensions and are difficult to stream or use with Zarr/LINDI.

What solution would you like?

Chunking of time series data by default.

Do you have any interest in helping implement the feature?

Yes.

Code of Conduct

bendichter commented 2 months ago

This is tougher than it sounds. We tried the default compression settings for a few large TimeSeries datasets early on and found that it provided chunks that were very long in time and narrow in channels, and were problematic for a number of use-cases including visualization with Neurosift. These default settings just don't really work well for us, and we need to be more thoughtful about our chunk shapes. We have implemented code that automatically wraps all TimeSeries with DataIOs here:

https://neuroconv.readthedocs.io/en/main/api/tools.nwb_helpers.html#neuroconv.tools.nwb_helpers.configure_and_write_nwbfile

The chunk shapes are determined with these types of considerations in mind. I'd be fine with thinking about migrating some of this into HDMF, esp since these tools are useful on their own outside of the rest of NeuroConv. Thoughts on this, @CodyCBakerPhD ?

oruebel commented 2 months ago

I suggest that TimeSeries.__init__ wraps data and timestamps with a H5DataIO or ZarrDataIO

TimeSeries.__init__ doesn't know the backend so wrapping there is challenging unless a user explicitly provides the specific DataIO class they want to use. But that doesn't seem much more convenient than wrapping the data directly. If you want default chunking option, then I think having logic in HDF5IO.write to decide what datasets to chunk may be easier to implement and cover all types (not just TimeSeries). However, as Ben mentioned, it's easy to get such a logic wrong because the optimal settings really depend on what the user wants to do with the data. I'd lean towards leaving the choice of chunking to the user.

sneakers-the-rat commented 2 months ago

came searching if y'all had talked about this as i am working on doing default chunking and compression. curious what would be the blockers to applying some sensible defaults for both chunking and compression? most neuro data is extremely compressible, and the format gives pretty decent hints about the sizes and shapes to expect. i think your average neuroscientist is probably not aware and likely doesn't care about chunking/compression, but they probably do care if it takes an order of magnitude time and space to use their data.

seems benchmarkable/optimizable? like write a simple chunk size guesser that eg. targets eg. 512kib chunks and measure compression ratio and io speed? would be happy to help if this is something we're interested in :)

bendichter commented 2 months ago

@sneakers-the-rat yes, we have done a lot of research on this. Here's the summary

compression ratio: we rarely ever see an order of magnitude. It's usually a savings of 20-50%, which is great, but I don't want to over-promise.

compressor: Of the HDF5-compatible compressors, zstd is a bit better than gzip all-around (read speed, write, speed, and compression ratio). However, it does not come as default with HDF5 and requires a bit of extra installation. You can do better with other compressors that you can use in Zarr but not easily in HDF5.

size: The official HDF Group recommendation used to be 10 KiB, which works well for on-disk but does not work well for streaming applications. 10 MiB is much better if you want to stream chunks from the cloud.

shape: This one is tricky. In h5py setting chunks=True creates chunks with default shape that are similar (in the geometric sense) to their datasets, and you would think this would be nice and versatile, but actually it isn't. The first problem with this approach is it uses the old 10KiB rec. Even if you change that to 10 MiB, the shapes don't really work. For time series, you end up with long strands in time, so if you want to view all channels for a short amount of time you have a very inefficient data access pattern. You instead want to group channels by e.g. 64. Types of considerations like this is why it's hard to find a one size fits all and you need to at least consider specific neurodata types.

We have implemented all of this in NeuroConv, and all of it is automatically applied by configure_and_write_nwbfile. See this tutorial: https://neuroconv.readthedocs.io/en/main/user_guide/backend_configuration.html

sneakers-the-rat commented 2 months ago

thx for the info :) yes you're right i was testing with gzip earlier and got ~50% across a half dozen files (small sample), was remembering the results i got w/ lossless and lossy video codecs on video data, my bad.

makes sense! sry for butting in, onwards to the glorious future where we untether from hdf5 <3