dandi / helpdesk

Repository to track help tickets from users.
3 stars 0 forks source link

Error while reading a dandiset using NWBHDF5IO that has ImagingVolume #126

Open craterkamath opened 6 months ago

craterkamath commented 6 months ago

Bug description

I'm trying to read stream/download dandisets from dandihub and the ones that have ImagingVolume, for example DANDI:000776, throw the below error:

Traceback (most recent call last): File "/home/vinayaka/anaconda3/envs/dandi/lib/python3.8/site-packages/hdmf/build/objectmapper.py", line 1258, in construct obj = self.new_container(cls, builder.source, parent, builder.attributes.get(self.spec.id_key()), File "/home/vinayaka/anaconda3/envs/dandi/lib/python3.8/site-packages/hdmf/build/objectmapper.py", line 1271, in new_container obj.init__(kwargs) File "/home/vinayaka/anaconda3/envs/dandi/lib/python3.8/site-packages/hdmf/utils.py", line 664, in func_call return func(args[0], pargs) File "/home/vinayaka/anaconda3/envs/dandi/lib/python3.8/site-packages/hdmf/build/classgenerator.py", line 339, in init setattr(self, f, arg_val) File "/home/vinayaka/anaconda3/envs/dandi/lib/python3.8/site-packages/hdmf/container.py", line 528, in container_setter ret[idx2](self, val) # call the previous setter File "/home/vinayaka/anaconda3/envs/dandi/lib/python3.8/site-packages/hdmf/container.py", line 518, in container_setter raise ValueError(msg) ValueError: Field 'order_optical_channels' on ImagingVolume must be named 'order_optical_channels'.

The error is not present when I try to read other datasets that do not have ImagingVolumes.

How to reproduce

I'm using the below code to read the dandiset

from dandi.dandiapi import DandiAPIClient
import pynwb
import h5py
from pynwb import NWBHDF5IO
import remfile

dandi_id = '000776'
with DandiAPIClient() as client:
    dandiset = client.get_dandiset(dandi_id, 'draft')
    for asset in dandiset.get_assets():
        s3_url = asset.get_content_url(follow_redirects=1, strip_query=True)
        file = remfile.File(s3_url)

        with h5py.File(file, 'r') as f:
            with NWBHDF5IO(file=f, mode='r',load_namespaces=True) as io:
                read_nwb = io.read()
                identifier = read_nwb.identifier
                seg = read_nwb.processing['NeuroPAL']['NeuroPALSegmentation']['NeuroPALNeurons'].voxel_mask[:]
                labels = read_nwb.processing['NeuroPAL']['NeuroPALSegmentation']['NeuroPALNeurons']['ID_labels'][:]
                channels = read_nwb.acquisition['NeuroPALImageRaw'].RGBW_channels[:] #get which channels of the image correspond to which RGBW pseudocolors
                image = read_nwb.acquisition['NeuroPALImageRaw'].data[:]
                scale = read_nwb.imaging_planes['NeuroPALImVol'].grid_spacing[:] #get which channels of the image correspond to which RGBW pseudocolors
                imvol = read_nwb.imaging_planes['NeuroPALImVol']
                print(imvol)
        print(identifier)
        break

Your personal set up

OS:

My package versions are below:

dandi==0.59.0
dandischema==0.8.4
pynwb==2.5.0
hdmf==3.11.0

Python environment to reproduce:

aiohttp==3.9.1
aiosignal==1.3.1
appdirs==1.4.4
arrow==1.3.0
asciitree==0.3.3
async-timeout==4.0.3
attrs==23.2.0
bidsschematools==0.7.2
blessed==1.20.0
boto3==1.34.20
botocore==1.34.20
certifi @ file:///croot/certifi_1700501669400/work/certifi
cffi==1.16.0
charset-normalizer==3.3.2
ci-info==0.3.0
click==8.1.7
click-didyoumean==0.3.0
cryptography==41.0.7
dandi==0.59.0
dandischema==0.8.4
dnspython==2.4.2
email-validator==2.1.0.post1
etelemetry==0.3.1
fasteners==0.19
fqdn==1.5.1
frozenlist==1.4.1
fscacher==0.4.0
fsspec==2023.12.2
h5py==3.10.0
hdmf==3.11.0
humanize==4.9.0
idna==3.6
importlib-metadata==7.0.1
importlib-resources==6.1.1
interleave==0.2.1
isodate==0.6.1
isoduration==20.11.0
jaraco.classes==3.3.0
jeepney==0.8.0
jmespath==1.0.1
joblib==1.3.2
jsonpointer==2.4
jsonschema==4.21.0
jsonschema-specifications==2023.12.1
keyring==24.3.0
keyrings.alt==5.0.0
more-itertools==10.2.0
multidict==6.0.4
natsort==8.4.0
numcodecs==0.12.1
numpy==1.24.4
nwbinspector==0.4.31
packaging==23.2
pandas==2.0.3
pkgutil_resolve_name==1.3.10
platformdirs==4.1.0
pycparser==2.21
pycryptodomex==3.20.0
pydantic==1.10.13
pynwb==2.5.0
pyout==0.7.3
python-dateutil==2.8.2
pytz==2023.3.post1
PyYAML==6.0.1
referencing==0.32.1
remfile==0.1.10
requests==2.31.0
rfc3339-validator==0.1.4
rfc3987==1.3.8
rpds-py==0.17.1
ruamel.yaml==0.18.5
ruamel.yaml.clib==0.2.8
s3fs==0.4.2
s3transfer==0.10.0
scipy==1.10.1
SecretStorage==3.3.3
semantic-version==2.10.0
six==1.16.0
tenacity==8.2.3
tqdm==4.66.1
types-python-dateutil==2.8.19.20240106
typing_extensions==4.9.0
tzdata==2023.4
uri-template==1.3.0
urllib3==1.26.18
wcwidth==0.2.13
webcolors==1.13
yarl==1.9.4
zarr==2.16.1
zarr-checksum==0.2.12
zipp==3.17.0

CodyCBakerPhD commented 6 months ago

attn: @rly

Perhaps an extension issue?

rly commented 6 months ago

Hi @craterkamath , the issue appears to be with the NWB file.

The spec for ImagingVolume says:

{
      "neurodata_type_def": "ImagingVolume",
      "neurodata_type_inc": "ImagingPlane",
      "doc": "An Imaging Volume and its Metadata",
      "groups": [
        {
          "doc": "An optical channel used to record from an imaging volume",
          "quantity": "*",
          "neurodata_type_inc": "OpticalChannelPlus"
        },
        {
          "doc": "Ordered list of names of the optical channels in the data",
          "name": "order_optical_channels",
          "neurodata_type_inc": "OpticalChannelReferences"
        }
      ]
    },

which specifies that ImagingVolume has a subgroup of neurodata type OpticalChannelReferences with required name "order_optical_channels". However, in these files, the ImagingVolume objects contain a link to a group of neurodata type OpticalChannelReferences with name "OpticalChannelRefs" (it lives at the path /processing/NeuroPAL/OpticalChannelRefs). This means the file does not conform to the extension spec. Unfortunately, this is most likely an oversight by PyNWB to allow the file to be created in this way and a bug in the validator not to catch this. I will create tickets on PyNWB for these issues.

@dysprague Did you create these files? I think the ones with ImagingVolume objects will have to be fixed to be valid NWB files and readable by the NWB APIs. Sorry. You can do that by either:

  1. adjusting the script to used to generate these files and re-generating the files from scratch, and/or
  2. performing "data surgery" to move and rename the group in the right place for these files, without re-generating the files from scratch.

I know these are big files so doing both options may be best. (1 for fixing this in future runs of the script and 2 to fix existing files quickly). I can help with either option.

attn @oruebel @bendichter in case you see this issue elsewhere

dysprague commented 5 months ago

Hi @rly Thanks for the help on this. What you're saying mostly makes sense to me but I had a few questions.

I am able to open these files completely fine on my own laptop. There is also dandiest 000692 created by Kotaro Kimura using the same spec which I am also able to open fine. The other thing that is confusing to me is that in the Spec, the MultiChannelVolume object also has a subgroup 'order_optical_channels' which is defined and set in the exact same way as it is for the ImagingVolume, so I'm not sure why the error is only being thrown for the ImagingVolume object.

When creating the 'ImagingVolume' object, how would I add the OpticalChannelsReferences object as a subgroup rather than as a link?

I can definitely update the code used to generate these files, but as you said these files are large so might be better to perform targeted data updates rather than fully regenerating files. I would appreciate some help figuring out how to do that.

Thanks, Daniel

rly commented 5 months ago

To follow up, @dysprague and I connected over Slack. @dysprague adjusted the script and ndx-multichannel-volume extension used to generate the files. I wrote a script to do the following data surgery steps for existing files:

  1. Replace the "order_optical_channels" link from all ImagingVolume objects with a subgroup that is the link target (at /processing/NeuroPAL/OpticalChannelRefs)
  2. Add the "ndx-multichannel-volume" version 0.1.12 schema
  3. Remove the "ndx-multichannel-volume" version 0.1.9 schema
  4. Remove the "order_optical_channels" group from all "MultiChannelVolume" objects
  5. Remove the "OpticalChannelRefs" group within "/processing"

The next steps are to run a script that checks each NWB file in dandiset 000776 for this issue and for each of those files, download the file, run the above script, and re-upload the file.

We will also want to adjust the NWB files in dandisets 000715, 000565, 000541, 000472, 000714, and possibly 000692.

This PR in HDMF https://github.com/hdmf-dev/hdmf/pull/1050 will catch these errors in the future during validation. I opened an issue in HDMF https://github.com/hdmf-dev/hdmf/issues/1051 to note that these name mismatch issues, or more generally, all validation issues, should raise an error on write.

craterkamath commented 5 months ago

Thanks @rly and @dysprague . Hoping to see the updated dataset on dandi Archive soon!