Closed rlmv closed 5 years ago
Hey @rlmv thanks for raising this. I'm looking into it now.
reproduced. Using DataChunkIterator
for data
works:
from datetime import datetime
from hdmf.data_utils import DataChunkIterator
from hdmf.backends.hdf5 import H5DataIO
from pynwb import NWBFile, TimeSeries, NWBHDF5IO
import numpy as np
nwbfile = NWBFile('Test', '123', datetime.now())
data = H5DataIO(DataChunkIterator(np.arange(0, 10, 1)))
timestamps = np.arange(0, 10, 1)
nwbfile.add_acquisition(
TimeSeries(
name='Test',
data=data,
timestamps=timestamps,
unit='uV'))
with NWBHDF5IO('out.nwb', 'w') as io:
io.write(nwbfile)
but using DataChunkIterator
for timestamps
does not:
from datetime import datetime
from hdmf.data_utils import DataChunkIterator
from hdmf.backends.hdf5 import H5DataIO
from pynwb import NWBFile, TimeSeries, NWBHDF5IO
import numpy as np
nwbfile = NWBFile('Test', '123', datetime.now())
timestamps = H5DataIO(DataChunkIterator(np.arange(0, 10, 1)))
data = np.arange(0, 10, 1)
nwbfile.add_acquisition(
TimeSeries(
name='Test',
data=data,
timestamps=timestamps,
unit='uV'))
with NWBHDF5IO('out.nwb', 'w') as io:
io.write(nwbfile)
/Users/bendichter/anaconda3/envs/dev_pynwb/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
/Users/bendichter/dev/pynwb/src/pynwb/file.py:621: UserWarning: Date is missing timezone information. Updating to local timezone.
warn("Date is missing timezone information. Updating to local timezone.")
Traceback (most recent call last):
File "/Users/bendichter/dev/hdmf/src/hdmf/build/map.py", line 984, in __add_datasets
data, dtype = self.convert_dtype(spec, attr_value)
File "/Users/bendichter/dev/hdmf/src/hdmf/build/map.py", line 418, in convert_dtype
ret, ret_dtype = cls.__check_edgecases(spec, value)
File "/Users/bendichter/dev/hdmf/src/hdmf/build/map.py", line 469, in __check_edgecases
return value, cls.convert_dtype(spec, value.data)[1]
File "/Users/bendichter/dev/hdmf/src/hdmf/build/map.py", line 450, in convert_dtype
ret = dtype_func(value)
TypeError: float() argument must be a string or a number, not 'DataChunkIterator'
@rlmv This is a bug and I'll see if I can fix it. I also wonder if you might be better off using starting_time
and rate
instead of timestamps
. That's what we generally recommend if the sampling rate is constant.
Iterative write works for data
and not for timestamps
because of the differences between their spec definition in TimeSeries
: https://github.com/NeurodataWithoutBorders/pynwb/blob/92a0463108c6010811f377c3a02da5d95c959094/src/pynwb/data/nwb.base.yaml#L126-L255 data
has no dtype
and timestamps
has dtype: float64
. If you remove dtype: float64
, or change it to dtype: numeric
, the above code works. That's not the right solution though, the right solution is to make iterative write work with datasets that have specific dtype
definitions.
@oruebel, do you have any advice how to proceed here?
I'll have a look. I think the fix is probably to either add a case to convert_dtype
to understand AbstractDataChunkIterator
or to add it to the ObjectMapper.__no_convert
. I'm testing right now and will get back to you soon.
Adding the following to the ObjectMapper
class in HDMF fixes the error on write. However, that is only part of the fix, we also need to make sure that the correct type is used on write. I'm looking at that now.
diff --git a/src/hdmf/build/map.py b/src/hdmf/build/map.py
index d50cdbe..c364599 100644
--- a/src/hdmf/build/map.py
+++ b/src/hdmf/build/map.py
@@ -439,6 +439,9 @@ class ObjectMapper(with_metaclass(ExtenderMeta, object)):
ret.append(tmp)
ret = type(value)(ret)
ret_dtype = tmp_dtype
+ elif isinstance(value, AbstractDataChunkIterator):
+ ret = value
+ ret_dtype = cls.__resolve_dtype(value.dtype, spec_dtype)
else:
if spec_dtype in (_unicode, _ascii):
ret_dtype = 'ascii'
@bendichter @oruebel Thanks for the quick response! I looked at using rate
instead of timestamps but, while the data in the series does have a constant sampling rate, in some cases there are chunks without samples. In that case, do you still recommend setting a rate with NaN
values for the missing data points? Or is it more conventional to write timestamps just for the readings that we do have?
I just committed the following changes to HDMF which I believe should fix the issue:
https://github.com/hdmf-dev/hdmf/commit/7c26e63947d659899191460b4e2b6b05055e5e19 fixes the problem of determining the correct dtype based on the spec or AbstractDataChunkIterator
https://github.com/hdmf-dev/hdmf/commit/255ee0c512826b4b92468609fe286abe1cc1812c fixes a bug in H5DataIO
to make sure the dtype from the builder is used on iterative data write.
I also added functions for dtype
and astype
to the DataChunk
class, but that's just a bonus not something that is strictly needed for this fix https://github.com/hdmf-dev/hdmf/commit/de550a804fbb173331cae30b0a42f619c4ccd50d
I accidentally pushed the changes directly to the HDMF dev branch (i.e., there is not PR for these fixes). We updated the protection settings for the dev branch to also enforce this for administrators ;-)
I'm closing the issue for now, since this is fixed with the latest HDMF now. Please reopen the issue in case you should still see the problem.
in some cases there are chunks without samples.
I believe in that case using timestamps is approbriate
1) Bug
I am unable to write the timestamps for a
TimeSeries
usingDataChunkIterator
andAbstractDataChunkIterator
, which the documentation suggests should be possible. The same error is raised if I wrap the iterator in anH5DataIO
.For more context, I'm trying to create an NWB file from some (potentially large) existing data. I would like to stream both the data and timestamps in order to not to have to load each array into memory before writing the file.
Steps to Reproduce
Stacktrace
When running this under
pytest
the error that is wrapped by theraise_from
isTypeError('float() argument must be a string or a number',)
, but it is not very clear where that error is coming from.Environment
Please describe your environment according to the following bullet points.
Checklist