[Feature]: Add Compression support for compound objects

pauladkisson commented 3 months ago

What would you like to see added to NeuroConv?

DatasetIOConfiguration.from_neurodata_object currently raises a NotImplementedError for compound objects with dtype 'object'.

This makes it very difficult to use the KiloSortSortingInterface, which includes a compound object for unit quality ('good' or 'noise').

Some chunking/compression should be selected by default, maybe with a warning that it might be sub-optimal for compound objects.

Do you have any interest in helping implement the feature?

Yes.

Code of Conduct

[X] I agree to follow this project's Code of Conduct
[X] Have you ensured this bug was not already reported?

h-mayorquin commented 3 months ago

Some chunking/compression should be selected by default, maybe with a warning that it might be sub-optimal for compound objects.

Why? How are you thinking about this? Is the cost of leaving them uncompressed so high? They usually appear on dynamic tables where I think the effect is not that large.

This makes it very difficult to use the KiloSortSortingInterface, which includes a compound object for unit quality ('good' or 'noise').

Question about this for my own learning: shouldn't those be strings and not objects?

https://hdmf-schema-language.readthedocs.io/en/latest/description.html#dtype

pauladkisson commented 3 months ago

Why? How are you thinking about this? Is the cost of leaving them uncompressed so high? They usually appear on dynamic tables where I think the effect is not that large.

The cost of leaving them uncompressed isn't the problem. The problem is that get_default_backend_configuration runs by default in run_conversion, which throws an error. Plus, it is actually not so easy to specify no compression, since you have to iterate through the neurodata objects yourself to avoid the error.

To be honest, it might be worthwhile to be able to specify no-compression at a higher level in addition to addressing this specific error.

pauladkisson commented 3 months ago

Question about this for my own learning: shouldn't those be strings and not objects?

pandas defaults to object for strings for backwards compatibility (https://pandas.pydata.org/docs/user_guide/text.html). And spikeinterface's BasePhyKilosortSortingExtractor uses pd.read_csv to get the unit info: https://github.com/SpikeInterface/spikeinterface/blob/161efc4ac4fd710fcf17cc836da82ac5c3e63c5a/src/spikeinterface/extractors/phykilosortextractors.py#L75

h-mayorquin commented 3 months ago

Regarding the Phy/Kilosort this is an easy case that we can modify on the fly here or in spikeinterface.

But in general maybe we should allow compression settings to strings. The problem would be to differentiate which arrays of dtype objects are really strings and which are not. Do we have examples of dtype objects that are not strings? I think things like groups but we would need to verify.

h-mayorquin commented 2 months ago

OK, the phy/kilosort problem is fixed, merged and should be included in the incoming release of Spikeinterface: https://github.com/SpikeInterface/spikeinterface/pull/3365

This issue can be separated into three main cases: 1) Arrays of strings but have object as a dtype. This is the case that started this issue. In this case, I think we can just add a check to see if all the elements are strings and then use the same logic as here: https://github.com/catalystneuro/neuroconv/blob/675ec48966eab421d25c15cd2ce6041ae6e8b34c/src/neuroconv/tools/nwb_helpers/_configuration_models/_base_dataset_io.py#L271-L278

(we are already doing that for arrays of numpy type "U" so it should be unified)

2) Arrays that contain a collection of heterogenous basic types: np.array(["string", 3.0, 1]. I think the solution of @pauladkisson in #1042 works for this. No compression, a chunk as big as the data (for lack of a better option) including a warning.

3) Arrays that contain python objects: np.array([dict(x=1), dict(y=2), dict(z=3)]. Those ones we can't even write so we can either just pass them empty and let pynwb handle the failure?

Examples: @CodyCBakerPhD Arrays with different basic types.

from hdmf.common import DynamicTable
from pynwb.testing.mock.file import mock_NWBFile
from pynwb import TimeSeries    
import numpy as np

nwbfile = mock_NWBFile()

dynamic_table = DynamicTable(name="table", description="desc")
dynamic_table.add_row()
dynamic_table.add_row()
dynamic_table.add_row()

data = np.asarray(["this", 35.0, 1])
dynamic_table.add_column(name="stuff", description="description", data=data)

time_series = TimeSeries(name="ts",data=data.copy(), unit="n.a.", rate=1.0)

nwbfile.add_acquisition(dynamic_table)
nwbfile.add_acquisition(time_series)

from pynwb import NWBHDF5IO

with NWBHDF5IO("test.nwb", mode="w") as io:
    io.write(nwbfile)

io = NWBHDF5IO("test.nwb", mode="r")
nwbfile_read = io.read()

Arrays with python objects (this does not work):

import numpy as np

class MyObject:
    def __init__(self, name, value):
        self.name = name
        self.value = value

obj1 = MyObject("Object1", 10)
obj2 = MyObject("Object2", 20)
obj3 = MyObject("Object3", 30)

# NumPy array containing custom objects (none work)
object_array = np.array([obj1, obj2, obj3], dtype=object)
object_array = np.array([dict(x=1), dict(y=2), dict(z=3)])

from pynwb import TimeSeries
from pynwb.testing.mock.file import mock_NWBFile
from hdmf.common import DynamicTable

nwbfile = mock_NWBFile()

dynamic_table = DynamicTable(name="table", description="desc")
dynamic_table.add_row()
dynamic_table.add_row()
dynamic_table.add_row()

dynamic_table.add_column(name="stuff", description="description", data=object_array.copy())

#time_series = TimeSeries(name="data", data=object_array.copy(), unit="n.a.", rate=1.0)

nwbfile.add_acquisition(dynamic_table)
nwbfile.add_acquisition(time_series)

from pynwb import NWBHDF5IO

# Write the NWBFile to a file
with NWBHDF5IO("test.nwb", mode="w") as io:
    io.write(nwbfile)

pauladkisson commented 2 months ago

Sounds good!

I can take 1&2 in #1042

3 should probably be a separate PR?

pauladkisson commented 2 months ago

Actually, numpy strings have itemsize 0, so the estimate methods fail.

we are already doing that for arrays of numpy type "U" so it should be unified

How did you deal with the itemsize problem in those cases?

h-mayorquin commented 2 months ago

Can you show me the code that is failing, maybe I misunderstood but I just ran:

import numpy as np

x = np.asarray(["a", "b", "c"])
x.dtype.itemsize, x.dtype, np.__version__
(4, dtype('<U1'), '1.26.4')

Yeah, 3 I would just ask the pynwb team, probably we don't need to do anything about it...

pauladkisson commented 2 months ago

I see now.

I was specifying dtype="U", which doesn't have an itemsize, but if all the strings are below a certain length it will use a fixed size (<U1, <U2, <U3, etc.)

h-mayorquin commented 2 months ago

Ah, that's a catch.

catalystneuro / neuroconv