[Bug]: Recommended Chunk Shape doesn't take into account compound dtypes

pauladkisson commented 3 weeks ago

What happened?

Dev tests are failing: https://github.com/catalystneuro/neuroconv/actions/runs/11567195585/job/32197117750

Tracked this down to an update with hdmf that uncovered these lines: https://github.com/hdmf-dev/hdmf/blob/dev/src/hdmf/backends/hdf5/h5tools.py#L1427-L1435

And our chunking recommendation based on the data shape here: https://github.com/catalystneuro/neuroconv/blob/main/src/neuroconv/tools/nwb_helpers/_configuration_models/_base_dataset_io.py#L263

Notice how in hdmf, if the data has a compound dtype, the shape is (len(data),), but in neuroconv the shape is always get_data_shape(data).

This throws an error when they mismatch in the case of a Caiman pixel_mask, which is a compound dtype in NWB.

The initial solution that I came up with was to load the NWB schema to figure out if a dataset is compound or not, but I was having some trouble finding the right code to load in the schema...

Steps to Reproduce

n/a

Traceback

No response

Operating System

Linux

Python Executable

Conda

Python Version

3.9

Package Versions

No response

Code of Conduct

[X] I agree to follow this project's Code of Conduct
[X] Have you ensured this bug was not already reported?

pauladkisson commented 2 weeks ago

Update: I figured out that I can use the builder's dtype (from io.get_builder(dataset)) to appropriately check for compound dtypes, BUT io.get_builder only works when the nwbfile is being read from disk -- it returns None when the nwbfile is in memory.

pauladkisson commented 2 weeks ago

@rly, when you have a chance could you provide some guidance on this issue?

How can I get a builder from an in-memory nwbfile? Or, if that is too difficult, how can I get access to the schema for a given neurodata object?

rly commented 3 days ago

@pauladkisson I do not totally follow what happened here. If I understand correctly, get_data_shape changed in HDMF and this broke something in neuroconv. What was the behavior of get_data_shape for compound data types before the change? Is this a bug in HDMF?

An in-memory (not yet written) NWB file has no builders until you call write. You can use HDMF internal functions to get the spec if you really need to:

import pynwb
manager = pynwb.get_manager()
manager.type_map.namespace_catalog.get_spec("core", "PlaneSegmentation").get_dataset("pixel_mask")

pauladkisson commented 3 days ago

If I understand correctly, get_data_shape changed in HDMF and this broke something in neuroconv.

Well, it isn't get_data_shape per se but rather this block of code:

        # define the data shape
        if 'shape' in io_settings:
            data_shape = io_settings.pop('shape')
        elif hasattr(data, 'shape'):
            data_shape = data.shape
        elif isinstance(dtype, np.dtype) and len(dtype) > 1:  # check if compound dtype
            data_shape = (len(data),)
        else:
            data_shape = get_data_shape(data)

And I'm not sure how exactly, but some recent changes uncovered this block of code so that it runs (__list_fill__) instead of __scalar_fill__. I had tracked that PR down previously, but didn't write it down unfortunately.

Is this a bug in HDMF?

I don't think so.

rly commented 2 days ago

Another approach that may be more robust and which does not require knowledge of the spec, but does require knowledge of the pynwb object attributes:

import pynwb
from pynwb.testing.mock.ophys import mock_PlaneSegmentation

manager = pynwb.get_manager()
ps = mock_PlaneSegmentation()
manager.type_map.get_map(ps).get_attr_spec("pixel_mask")

catalystneuro / neuroconv