Cannot iteratively write timestamps with DataChunkIterator

rlmv commented 5 years ago

1) Bug

I am unable to write the timestamps for a TimeSeries using DataChunkIterator and AbstractDataChunkIterator, which the documentation suggests should be possible. The same error is raised if I wrap the iterator in an H5DataIO.

For more context, I'm trying to create an NWB file from some (potentially large) existing data. I would like to stream both the data and timestamps in order to not to have to load each array into memory before writing the file.

Steps to Reproduce

from datetime import datetime
from hdmf.data_utils import DataChunkIterator
from pynwb import NWBFile, TimeSeries, NWBHDF5IO
import numpy as np

nwbfile = NWBFile('Test', '123', datetime.now())

def timestamp_iterator():
    for i in range(10):
        return float(i)

timestamps = DataChunkIterator(timestamp_iterator())
data = np.arange(0, 10, 1),

nwbfile.add_acquisition(
    TimeSeries(
        name='Test',
        data=data,
        timestamps=timestamps,
        unit='uV'))

with NWBHDF5IO('out.nwb', 'w') as io:
    io.write(nwbfile)

Stacktrace

When running this under pytest the error that is wrapped by the raise_from is TypeError('float() argument must be a string or a number',), but it is not very clear where that error is coming from.

Traceback (most recent call last):
  File "timeseries_exporter/timeseries_exporter/error.py", line 24, in <module>
    io.write(nwbfile)
  File "/Users/bomarchman/.pyenv/versions/timeseries-processor/lib/python2.7/site-packages/hdmf/utils.py", line 388, in func_call
    return func(self, **parsed['args'])
  File "/Users/bomarchman/.pyenv/versions/timeseries-processor/lib/python2.7/site-packages/hdmf/backends/hdf5/h5tools.py", line 219, in write
    call_docval_func(super(HDF5IO, self).write, kwargs)
  File "/Users/bomarchman/.pyenv/versions/timeseries-processor/lib/python2.7/site-packages/hdmf/utils.py", line 281, in call_docval_func
    return func(*fargs, **fkwargs)
  File "/Users/bomarchman/.pyenv/versions/timeseries-processor/lib/python2.7/site-packages/hdmf/utils.py", line 388, in func_call
    return func(self, **parsed['args'])
  File "/Users/bomarchman/.pyenv/versions/timeseries-processor/lib/python2.7/site-packages/hdmf/backends/io.py", line 41, in write
    f_builder = self.__manager.build(container, source=self.__source)
  File "/Users/bomarchman/.pyenv/versions/timeseries-processor/lib/python2.7/site-packages/hdmf/utils.py", line 388, in func_call
    return func(self, **parsed['args'])
  File "/Users/bomarchman/.pyenv/versions/timeseries-processor/lib/python2.7/site-packages/hdmf/build/map.py", line 164, in build
    result = self.__type_map.build(container, self, source=source, spec_ext=spec_ext)
  File "/Users/bomarchman/.pyenv/versions/timeseries-processor/lib/python2.7/site-packages/hdmf/utils.py", line 388, in func_call
    return func(self, **parsed['args'])
  File "/Users/bomarchman/.pyenv/versions/timeseries-processor/lib/python2.7/site-packages/hdmf/build/map.py", line 1652, in build
    builder = attr_map.build(container, manager, builder=builder, source=source, spec_ext=spec_ext)
  File "/Users/bomarchman/.pyenv/versions/timeseries-processor/lib/python2.7/site-packages/hdmf/utils.py", line 388, in func_call
    return func(self, **parsed['args'])
  File "/Users/bomarchman/.pyenv/versions/timeseries-processor/lib/python2.7/site-packages/hdmf/build/map.py", line 788, in build
    self.__add_groups(builder, self.__spec.groups, container, manager, source)
  File "/Users/bomarchman/.pyenv/versions/timeseries-processor/lib/python2.7/site-packages/hdmf/build/map.py", line 1016, in __add_groups
    self.__add_containers(sub_builder, spec, item, build_manager, source, container)
  File "/Users/bomarchman/.pyenv/versions/timeseries-processor/lib/python2.7/site-packages/hdmf/build/map.py", line 1044, in __add_containers
    rendered_obj = build_manager.build(value, source=source, spec_ext=spec)
  File "/Users/bomarchman/.pyenv/versions/timeseries-processor/lib/python2.7/site-packages/hdmf/utils.py", line 388, in func_call
    return func(self, **parsed['args'])
  File "/Users/bomarchman/.pyenv/versions/timeseries-processor/lib/python2.7/site-packages/hdmf/build/map.py", line 164, in build
    result = self.__type_map.build(container, self, source=source, spec_ext=spec_ext)
  File "/Users/bomarchman/.pyenv/versions/timeseries-processor/lib/python2.7/site-packages/hdmf/utils.py", line 388, in func_call
    return func(self, **parsed['args'])
  File "/Users/bomarchman/.pyenv/versions/timeseries-processor/lib/python2.7/site-packages/hdmf/build/map.py", line 1652, in build
    builder = attr_map.build(container, manager, builder=builder, source=source, spec_ext=spec_ext)
  File "/Users/bomarchman/.pyenv/versions/timeseries-processor/lib/python2.7/site-packages/hdmf/utils.py", line 388, in func_call
    return func(self, **parsed['args'])
  File "/Users/bomarchman/.pyenv/versions/timeseries-processor/lib/python2.7/site-packages/hdmf/build/map.py", line 787, in build
    self.__add_datasets(builder, self.__spec.datasets, container, manager, source)
  File "/Users/bomarchman/.pyenv/versions/timeseries-processor/lib/python2.7/site-packages/hdmf/build/map.py", line 988, in __add_datasets
    raise_from(Exception(msg), ex)
  File "/Users/bomarchman/.pyenv/versions/timeseries-processor/lib/python2.7/site-packages/six.py", line 737, in raise_from
    raise value
Exception: could not convert 'timestamps' for TimeSeries 'Test'

Environment

Please describe your environment according to the following bullet points.

Python Executable: Python
Python Version: Python 2.7 and Python 3.6
Operating System: MacOS and Linux
Pynwb Version: 1.0.2

Checklist

[x] Have you ensured the feature or change was not already reported ?
[x] Have you included a brief and descriptive title?
[x] Have you included a clear description of the problem you are trying to solve?
[x] Have you included a minimal code snippet that reproduces the issue you are encountering?
[x] Have you checked our Contributing document?

bendichter commented 5 years ago

Hey @rlmv thanks for raising this. I'm looking into it now.

bendichter commented 5 years ago

reproduced. Using DataChunkIterator for data works:

from datetime import datetime
from hdmf.data_utils import DataChunkIterator
from hdmf.backends.hdf5 import H5DataIO
from pynwb import NWBFile, TimeSeries, NWBHDF5IO
import numpy as np

nwbfile = NWBFile('Test', '123', datetime.now())

data = H5DataIO(DataChunkIterator(np.arange(0, 10, 1)))

timestamps = np.arange(0, 10, 1)

nwbfile.add_acquisition(
    TimeSeries(
        name='Test',
        data=data,
        timestamps=timestamps,
        unit='uV'))

with NWBHDF5IO('out.nwb', 'w') as io:
    io.write(nwbfile)

but using DataChunkIterator for timestamps does not:

from datetime import datetime
from hdmf.data_utils import DataChunkIterator
from hdmf.backends.hdf5 import H5DataIO
from pynwb import NWBFile, TimeSeries, NWBHDF5IO
import numpy as np

nwbfile = NWBFile('Test', '123', datetime.now())

timestamps = H5DataIO(DataChunkIterator(np.arange(0, 10, 1)))

data = np.arange(0, 10, 1)

nwbfile.add_acquisition(
    TimeSeries(
        name='Test',
        data=data,
        timestamps=timestamps,
        unit='uV'))

with NWBHDF5IO('out.nwb', 'w') as io:
    io.write(nwbfile)

/Users/bendichter/anaconda3/envs/dev_pynwb/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
/Users/bendichter/dev/pynwb/src/pynwb/file.py:621: UserWarning: Date is missing timezone information. Updating to local timezone.
  warn("Date is missing timezone information. Updating to local timezone.")
Traceback (most recent call last):
  File "/Users/bendichter/dev/hdmf/src/hdmf/build/map.py", line 984, in __add_datasets
    data, dtype = self.convert_dtype(spec, attr_value)
  File "/Users/bendichter/dev/hdmf/src/hdmf/build/map.py", line 418, in convert_dtype
    ret, ret_dtype = cls.__check_edgecases(spec, value)
  File "/Users/bendichter/dev/hdmf/src/hdmf/build/map.py", line 469, in __check_edgecases
    return value, cls.convert_dtype(spec, value.data)[1]
  File "/Users/bendichter/dev/hdmf/src/hdmf/build/map.py", line 450, in convert_dtype
    ret = dtype_func(value)
TypeError: float() argument must be a string or a number, not 'DataChunkIterator'

bendichter commented 5 years ago

@rlmv This is a bug and I'll see if I can fix it. I also wonder if you might be better off using starting_time and rate instead of timestamps. That's what we generally recommend if the sampling rate is constant.

bendichter commented 5 years ago

Iterative write works for data and not for timestamps because of the differences between their spec definition in TimeSeries: https://github.com/NeurodataWithoutBorders/pynwb/blob/92a0463108c6010811f377c3a02da5d95c959094/src/pynwb/data/nwb.base.yaml#L126-L255 data has no dtype and timestamps has dtype: float64. If you remove dtype: float64, or change it to dtype: numeric, the above code works. That's not the right solution though, the right solution is to make iterative write work with datasets that have specific dtype definitions.

@oruebel, do you have any advice how to proceed here?

oruebel commented 5 years ago

I'll have a look. I think the fix is probably to either add a case to convert_dtype to understand AbstractDataChunkIterator or to add it to the ObjectMapper.__no_convert. I'm testing right now and will get back to you soon.

oruebel commented 5 years ago

Adding the following to the ObjectMapper class in HDMF fixes the error on write. However, that is only part of the fix, we also need to make sure that the correct type is used on write. I'm looking at that now.

diff --git a/src/hdmf/build/map.py b/src/hdmf/build/map.py
index d50cdbe..c364599 100644
--- a/src/hdmf/build/map.py
+++ b/src/hdmf/build/map.py
@@ -439,6 +439,9 @@ class ObjectMapper(with_metaclass(ExtenderMeta, object)):
                 ret.append(tmp)
             ret = type(value)(ret)
             ret_dtype = tmp_dtype
+        elif isinstance(value, AbstractDataChunkIterator):
+            ret = value
+            ret_dtype = cls.__resolve_dtype(value.dtype, spec_dtype)
         else:
             if spec_dtype in (_unicode, _ascii):
                 ret_dtype = 'ascii'

rlmv commented 5 years ago

@bendichter @oruebel Thanks for the quick response! I looked at using rate instead of timestamps but, while the data in the series does have a constant sampling rate, in some cases there are chunks without samples. In that case, do you still recommend setting a rate with NaN values for the missing data points? Or is it more conventional to write timestamps just for the readings that we do have?

oruebel commented 5 years ago

I just committed the following changes to HDMF which I believe should fix the issue:

https://github.com/hdmf-dev/hdmf/commit/7c26e63947d659899191460b4e2b6b05055e5e19 fixes the problem of determining the correct dtype based on the spec or AbstractDataChunkIterator

https://github.com/hdmf-dev/hdmf/commit/255ee0c512826b4b92468609fe286abe1cc1812c fixes a bug in H5DataIO to make sure the dtype from the builder is used on iterative data write.

I also added functions for dtype and astype to the DataChunk class, but that's just a bonus not something that is strictly needed for this fix https://github.com/hdmf-dev/hdmf/commit/de550a804fbb173331cae30b0a42f619c4ccd50d

I accidentally pushed the changes directly to the HDMF dev branch (i.e., there is not PR for these fixes). We updated the protection settings for the dev branch to also enforce this for administrators ;-)

oruebel commented 5 years ago

I'm closing the issue for now, since this is fixed with the latest HDMF now. Please reopen the issue in case you should still see the problem.

in some cases there are chunks without samples.

I believe in that case using timestamps is approbriate

NeurodataWithoutBorders / pynwb