[Bug]: Unbounded stream to dataset with H5DataIO timestamps fails

cboulay commented 3 months ago

What happened?

I am streaming multiple unbounded data streams to NWB (i.e., recording live data), with one stream->dataset per file; I intend to combine files after recording. Most of the documentation around iterative writing outlines how to wrap around a DataChunkIterator around a generator but I find this cumbersome for unbounded data because I need to put the generator in another thread that pulls from a queue, and use the main thread to supply data to the queue.

The alternative approach is more flexible and seems to fit my use case better. However, the alternative approach does not work for timestamps! It seems the DataChunkIterator does work for timestamps.

Unfortunately, I don't fully trust the reported sample rate coming from my data sources so I would prefer to store the timestamps over setting the start time and srate only.

If using a DataChunkIterator is preferable to using H5DataIO then are there any other patterns I should try? Or do I bite the bullet and use the multi-threaded approach?

Steps to Reproduce

import datetime
import numpy as np
from hdmf.backends.hdf5.h5_utils import H5DataIO
from pynwb import NWBHDF5IO, NWBFile, TimeSeries

def test_nwb_stream_timestamps():
    # srate = 10.0
    data_shape = 20, 16, 2
    test_data = np.arange(np.prod(data_shape)).reshape(data_shape)

    dataio = H5DataIO(
        shape=(0,) + data_shape[1:],
        dtype=test_data.dtype,
        maxshape=(None,) + data_shape[1:],
        fillvalue=np.nan,
    )
    timestampio = H5DataIO(
        shape=(0,),
        dtype=np.float64,
        maxshape=(None,),
        fillvalue=np.nan,
    )
    # Next line fails
    data_series = TimeSeries(
        name="TODO: name from params",
        data=dataio,
        timestamps=timestampio,
        # rate=srate,  # Comment previous line and uncomment this to fix error
        unit="n/a",
    )

    nwbfile = NWBFile(
        session_description="demonstrate streaming timestamps",
        identifier="abcd1234",
        session_start_time=datetime.datetime.now(datetime.timezone.utc),
    )
    nwbfile.add_acquisition(data_series)

    # Write the data to file
    file_path = Path(tempfile.gettempdir()) / "test_nwb_stream_timestamps.nwb"
    io = NWBHDF5IO(file_path, "w")
    io.write(nwbfile)

    dataio.dataset.resize(len(dataio.dataset) + len(data), axis=0)
    dataio.dataset[-len(data) :] = data
    timestampio.dataset.resize(len(timestampio.dataset) + len(timestamps), axis=0)
    timestampio.dataset[-len(timestamps) :] = timestamps

    io.close()

Traceback

>       data_series = TimeSeries(
            name="TODO: name from params",
            data=dataio,
            timestamps=timestampio,
            # rate=srate,
            unit="n/a",  # TODO
        )

test_nwb.py:43: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../../.venv/lib/python3.9/site-packages/hdmf/utils.py:667: in func_call
    pargs = _check_args(args, kwargs)
../../../.venv/lib/python3.9/site-packages/hdmf/utils.py:639: in _check_args
    parsed = __parse_args(
../../../.venv/lib/python3.9/site-packages/hdmf/utils.py:351: in __parse_args
    valshape = get_data_shape(argval)
../../../.venv/lib/python3.9/site-packages/hdmf/utils.py:916: in get_data_shape
    if hasattr(data, 'maxshape'):
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <hdmf.backends.hdf5.h5_utils.H5DataIO object at 0x16db5fa00>
attr = 'maxshape'

    def __getattr__(self, attr):
        """Delegate attribute lookup to data object"""
        if attr == '__array_struct__' and not self.valid:
            # np.array() checks __array__ or __array_struct__ attribute dep. on numpy version
            raise InvalidDataIOError("Cannot convert data to array. Data is not valid.")
        if not self.valid:
>           raise InvalidDataIOError("Cannot get attribute '%s' of data. Data is not valid." % attr)
E           hdmf.data_utils.InvalidDataIOError: Cannot get attribute 'maxshape' of data. Data is not valid.

../../../.venv/lib/python3.9/site-packages/hdmf/data_utils.py:1104: InvalidDataIOError

Operating System

macOS

Python Executable

Python

Python Version

3.9

Package Versions

env.txt

Code of Conduct

[X] I agree to follow this project's Code of Conduct
[X] Have you checked the Contributing document?
[X] Have you ensured this bug was not already reported?

cboulay commented 3 months ago

I may have jumped to a conclusion about the error. Even if I try manual timestamps with np.arange(test_data.shape[0]), I still get the same error. It's the data that is missing the maxshape. We only see the error when timestamps are provided because the TimeSeries __init__ call to self._check_time_series_dimension() only looks for maxshape if timestamps is not None.

But, I did provide the maxshape argument to my H5DataIO object... so where did it go? It goes into the object's io_settings, but the H5DataIO itself does not have a maxshape attribute or property.

I added the following to H5DataIO in hdmf/backends/hdf5/h5_utils.py

    @property
    def maxshape(self):
        return self.io_settings["maxshape"]

And with that, my test code works.

Edit: the examples in #1011 worked because DataChunkIterator has a getter for maxshape. So I'm guessing the "alternative approach" in the docs never worked with manual timestamps because of this maxshape problem.

stephprince commented 2 months ago

Hi @cboulay, thanks for the detailed code example and bug description!

I think using the "alternative approach" to create the H5DataIO objects for data and timestamps seems reasonable for your use case.

Your fix to add a maxshape property to the H5DataIO object seems like a good way to address this issue. I believe this issue should mostly be on the hdmf side, so thank you for already opening the relevant issue there. Would you be interested in filing a pull request in the hdmf repository to fix this bug?

cboulay commented 2 months ago

PR made: https://github.com/hdmf-dev/hdmf/pull/1149

cboulay commented 1 month ago

upstream PR merged

NeurodataWithoutBorders / pynwb