Open pauladkisson opened 3 months ago
Some chunking/compression should be selected by default, maybe with a warning that it might be sub-optimal for compound objects.
Why? How are you thinking about this? Is the cost of leaving them uncompressed so high? They usually appear on dynamic tables where I think the effect is not that large.
This makes it very difficult to use the KiloSortSortingInterface, which includes a compound object for unit quality ('good' or 'noise').
Question about this for my own learning: shouldn't those be strings and not objects?
https://hdmf-schema-language.readthedocs.io/en/latest/description.html#dtype
Why? How are you thinking about this? Is the cost of leaving them uncompressed so high? They usually appear on dynamic tables where I think the effect is not that large.
The cost of leaving them uncompressed isn't the problem. The problem is that get_default_backend_configuration
runs by default in run_conversion
, which throws an error. Plus, it is actually not so easy to specify no compression, since you have to iterate through the neurodata objects yourself to avoid the error.
To be honest, it might be worthwhile to be able to specify no-compression at a higher level in addition to addressing this specific error.
Question about this for my own learning: shouldn't those be strings and not objects?
pandas defaults to object
for strings for backwards compatibility (https://pandas.pydata.org/docs/user_guide/text.html). And spikeinterface's BasePhyKilosortSortingExtractor
uses pd.read_csv
to get the unit info: https://github.com/SpikeInterface/spikeinterface/blob/161efc4ac4fd710fcf17cc836da82ac5c3e63c5a/src/spikeinterface/extractors/phykilosortextractors.py#L75
Regarding the Phy/Kilosort this is an easy case that we can modify on the fly here or in spikeinterface.
But in general maybe we should allow compression settings to strings. The problem would be to differentiate which arrays of dtype objects are really strings and which are not. Do we have examples of dtype objects that are not strings? I think things like groups but we would need to verify.
OK, the phy/kilosort problem is fixed, merged and should be included in the incoming release of Spikeinterface: https://github.com/SpikeInterface/spikeinterface/pull/3365
This issue can be separated into three main cases: 1) Arrays of strings but have object as a dtype. This is the case that started this issue. In this case, I think we can just add a check to see if all the elements are strings and then use the same logic as here: https://github.com/catalystneuro/neuroconv/blob/675ec48966eab421d25c15cd2ce6041ae6e8b34c/src/neuroconv/tools/nwb_helpers/_configuration_models/_base_dataset_io.py#L271-L278
(we are already doing that for arrays of numpy type "U" so it should be unified)
2) Arrays that contain a collection of heterogenous basic types: np.array(["string", 3.0, 1]
. I think the solution of @pauladkisson in #1042 works for this. No compression, a chunk as big as the data (for lack of a better option) including a warning.
3) Arrays that contain python objects: np.array([dict(x=1), dict(y=2), dict(z=3)]
. Those ones we can't even write so we can either just pass them empty and let pynwb handle the failure?
Examples: @CodyCBakerPhD Arrays with different basic types.
from hdmf.common import DynamicTable
from pynwb.testing.mock.file import mock_NWBFile
from pynwb import TimeSeries
import numpy as np
nwbfile = mock_NWBFile()
dynamic_table = DynamicTable(name="table", description="desc")
dynamic_table.add_row()
dynamic_table.add_row()
dynamic_table.add_row()
data = np.asarray(["this", 35.0, 1])
dynamic_table.add_column(name="stuff", description="description", data=data)
time_series = TimeSeries(name="ts",data=data.copy(), unit="n.a.", rate=1.0)
nwbfile.add_acquisition(dynamic_table)
nwbfile.add_acquisition(time_series)
from pynwb import NWBHDF5IO
with NWBHDF5IO("test.nwb", mode="w") as io:
io.write(nwbfile)
io = NWBHDF5IO("test.nwb", mode="r")
nwbfile_read = io.read()
Arrays with python objects (this does not work):
import numpy as np
class MyObject:
def __init__(self, name, value):
self.name = name
self.value = value
obj1 = MyObject("Object1", 10)
obj2 = MyObject("Object2", 20)
obj3 = MyObject("Object3", 30)
# NumPy array containing custom objects (none work)
object_array = np.array([obj1, obj2, obj3], dtype=object)
object_array = np.array([dict(x=1), dict(y=2), dict(z=3)])
from pynwb import TimeSeries
from pynwb.testing.mock.file import mock_NWBFile
from hdmf.common import DynamicTable
nwbfile = mock_NWBFile()
dynamic_table = DynamicTable(name="table", description="desc")
dynamic_table.add_row()
dynamic_table.add_row()
dynamic_table.add_row()
dynamic_table.add_column(name="stuff", description="description", data=object_array.copy())
#time_series = TimeSeries(name="data", data=object_array.copy(), unit="n.a.", rate=1.0)
nwbfile.add_acquisition(dynamic_table)
nwbfile.add_acquisition(time_series)
from pynwb import NWBHDF5IO
# Write the NWBFile to a file
with NWBHDF5IO("test.nwb", mode="w") as io:
io.write(nwbfile)
Sounds good!
I can take 1&2 in #1042
3 should probably be a separate PR?
Actually, numpy strings have itemsize 0, so the estimate methods fail.
we are already doing that for arrays of numpy type "U" so it should be unified
How did you deal with the itemsize problem in those cases?
Can you show me the code that is failing, maybe I misunderstood but I just ran:
import numpy as np
x = np.asarray(["a", "b", "c"])
x.dtype.itemsize, x.dtype, np.__version__
(4, dtype('<U1'), '1.26.4')
Yeah, 3 I would just ask the pynwb team, probably we don't need to do anything about it...
I see now.
I was specifying dtype="U", which doesn't have an itemsize, but if all the strings are below a certain length it will use a fixed size (<U1, <U2, <U3, etc.)
Ah, that's a catch.
What would you like to see added to NeuroConv?
DatasetIOConfiguration.from_neurodata_object
currently raises aNotImplementedError
for compound objects with dtype 'object'.This makes it very difficult to use the
KiloSortSortingInterface
, which includes a compound object for unit quality ('good' or 'noise').Some chunking/compression should be selected by default, maybe with a warning that it might be sub-optimal for compound objects.
Do you have any interest in helping implement the feature?
Yes.
Code of Conduct