Open rly opened 3 months ago
This is tougher than it sounds. We tried the default compression settings for a few large TimeSeries datasets early on and found that it provided chunks that were very long in time and narrow in channels, and were problematic for a number of use-cases including visualization with Neurosift. These default settings just don't really work well for us, and we need to be more thoughtful about our chunk shapes. We have implemented code that automatically wraps all TimeSeries with DataIOs here:
The chunk shapes are determined with these types of considerations in mind. I'd be fine with thinking about migrating some of this into HDMF, esp since these tools are useful on their own outside of the rest of NeuroConv. Thoughts on this, @CodyCBakerPhD ?
I suggest that
TimeSeries.__init__
wraps data and timestamps with aH5DataIO
orZarrDataIO
TimeSeries.__init__
doesn't know the backend so wrapping there is challenging unless a user explicitly provides the specific DataIO
class they want to use. But that doesn't seem much more convenient than wrapping the data directly. If you want default chunking option, then I think having logic in HDF5IO.write
to decide what datasets to chunk may be easier to implement and cover all types (not just TimeSeries). However, as Ben mentioned, it's easy to get such a logic wrong because the optimal settings really depend on what the user wants to do with the data. I'd lean towards leaving the choice of chunking to the user.
came searching if y'all had talked about this as i am working on doing default chunking and compression. curious what would be the blockers to applying some sensible defaults for both chunking and compression? most neuro data is extremely compressible, and the format gives pretty decent hints about the sizes and shapes to expect. i think your average neuroscientist is probably not aware and likely doesn't care about chunking/compression, but they probably do care if it takes an order of magnitude time and space to use their data.
seems benchmarkable/optimizable? like write a simple chunk size guesser that eg. targets eg. 512kib chunks and measure compression ratio and io speed? would be happy to help if this is something we're interested in :)
@sneakers-the-rat yes, we have done a lot of research on this. Here's the summary
compression ratio: we rarely ever see an order of magnitude. It's usually a savings of 20-50%, which is great, but I don't want to over-promise.
compressor: Of the HDF5-compatible compressors, zstd is a bit better than gzip all-around (read speed, write, speed, and compression ratio). However, it does not come as default with HDF5 and requires a bit of extra installation. You can do better with other compressors that you can use in Zarr but not easily in HDF5.
size: The official HDF Group recommendation used to be 10 KiB, which works well for on-disk but does not work well for streaming applications. 10 MiB is much better if you want to stream chunks from the cloud.
shape: This one is tricky. In h5py setting chunks=True
creates chunks with default shape that are similar (in the geometric sense) to their datasets, and you would think this would be nice and versatile, but actually it isn't. The first problem with this approach is it uses the old 10KiB rec. Even if you change that to 10 MiB, the shapes don't really work. For time series, you end up with long strands in time, so if you want to view all channels for a short amount of time you have a very inefficient data access pattern. You instead want to group channels by e.g. 64. Types of considerations like this is why it's hard to find a one size fits all and you need to at least consider specific neurodata types.
We have implemented all of this in NeuroConv, and all of it is automatically applied by configure_and_write_nwbfile
. See this tutorial: https://neuroconv.readthedocs.io/en/main/user_guide/backend_configuration.html
thx for the info :) yes you're right i was testing with gzip earlier and got ~50% across a half dozen files (small sample), was remembering the results i got w/ lossless and lossy video codecs on video data, my bad.
makes sense! sry for butting in, onwards to the glorious future where we untether from hdf5 <3
What would you like to see added to PyNWB?
Chunking generally improves read/write performance and is more cloud-friendly (and LINDI-friendly). (Related to https://github.com/NeurodataWithoutBorders/lindi/pull/84.)
I suggest that
TimeSeries.__init__
wraps data and timestamps with aH5DataIO
orZarrDataIO
, depending on backend, withchunks=True
, if the input data/timestamps are not already wrapped. We can add flags to saychunk_data=True
,chunk_timestamps=True
that users can flip to turn off this behavior. A challenge will be figuring out the backend withinTimeSeries
...We could use the h5py defaults for now, and more targeted defaults for
ElectricalSeries
data /TwoPhotonSeries
data later.I believe all Zarr data are already chunked by default.
Is your feature request related to a problem?
Contiguous HDF5 datasets have slow performance in the non-contiguous dimensions and are difficult to stream or use with Zarr/LINDI.
What solution would you like?
Chunking of time series data by default.
Do you have any interest in helping implement the feature?
Yes.
Code of Conduct