catalystneuro / neuroconv

Create NWB files by converting and combining neural data in proprietary formats and adding essential metadata.
https://neuroconv.readthedocs.io
BSD 3-Clause "New" or "Revised" License
50 stars 21 forks source link

[Bug]: Non-compliant JSON allowed through NWBMetadataEncoder #515

Closed garrettmflynn closed 2 months ago

garrettmflynn commented 1 year ago

What happened?

When trying to decode the metadata results received from the BrukerTif/NCCR32_2023_02_20_Into_the_void_t_series_baseline-000 dataset (ophys_testing_data/imaging_dataset) on the browser for the NWB GUIDE, there are breaking errors because NaN values are not part of the JSON specification but allowed with json.dumps and json.loads.

Steps to Reproduce

Run the following script on the DANDI Hub using an environment with neuroconv[ophys] installed.

from neuroconv.datainterfaces import BrukerTiffImagingInterface
from neuroconv import NWBConverter
from neuroconv.utils import NWBMetaDataEncoder
import json

class CustomNWBConverter(NWBConverter):
    data_interface_classes = {
        "Bruker": BrukerTiffImagingInterface
    }

converter = CustomNWBConverter({
    "Bruker": {
        "folder_path": "/shared/catalystneuro/ophys_testing_data/imaging_datasets/BrukerTif/NCCR32_2023_02_20_Into_the_void_t_series_baseline-000"
    }
})
metadata = converter.get_metadata()

jsonWithNaN = json.dumps(metadata, cls=NWBMetaDataEncoder)

The jsonWithNaN is a JSON object that will be sent to the browser with NaN values.

If you'd like to get an error on the Python side, you can simply run json.dumps with the allow_nan kwarg set to False:

jsonWithNaN = json.dumps(metadata, cls=NWBMetaDataEncoder, allow_nan=False)

Traceback

This is an error from the Chrome Developer Console as a result of trying to parse the invalid JSON string:

Uncaught (in promise) SyntaxError: Unexpected token N in JSON at position 401

The traceback for using allow_nan=False is as follows:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[30], line 1
----> 1 jsonWithNaN = json.dumps(metadata, cls=NWBMetaDataEncoder, allow_nan=False)

File /opt/conda/lib/python3.10/json/__init__.py:238, in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
    232 if cls is None:
    233     cls = JSONEncoder
    234 return cls(
    235     skipkeys=skipkeys, ensure_ascii=ensure_ascii,
    236     check_circular=check_circular, allow_nan=allow_nan, indent=indent,
    237     separators=separators, default=default, sort_keys=sort_keys,
--> 238     **kw).encode(obj)

File /opt/conda/lib/python3.10/json/encoder.py:199, in JSONEncoder.encode(self, o)
    195         return encode_basestring(o)
    196 # This doesn't pass the iterator directly to ''.join() because the
    197 # exceptions aren't as detailed.  The list call should be roughly
    198 # equivalent to the PySequence_Fast that ''.join() would do.
--> 199 chunks = self.iterencode(o, _one_shot=True)
    200 if not isinstance(chunks, (list, tuple)):
    201     chunks = list(chunks)

File /opt/conda/lib/python3.10/json/encoder.py:257, in JSONEncoder.iterencode(self, o, _one_shot)
    252 else:
    253     _iterencode = _make_iterencode(
    254         markers, self.default, _encoder, self.indent, floatstr,
    255         self.key_separator, self.item_separator, self.sort_keys,
    256         self.skipkeys, _one_shot)
--> 257 return _iterencode(o, 0)

ValueError: Out of range float values are not JSON compliant

Operating System

macOS

Python Executable

Conda

Python Version

3.7

Package Versions

No response

Code of Conduct

CodyCBakerPhD commented 1 year ago

Thanks for getting the issue started @garrettmflynn

For next time I'd also love to see the raw output of that JSON dump, which is what I'm actually interested in knowing 'why we need or have NaNs there already'

Here's the output from print(json.dumps(metadata, cls=NWBMetaDataEncoder, indent=4))

{
    "NWBFile": {
        "session_description": "Auto-generated by neuroconv",
        "identifier": "85131ea3-4d27-44dc-b83a-d55e784ffb1b",
        "session_start_time": "2023-02-20T15:58:25"
    },
    "Ophys": {
        "Device": [
            {
                "name": "BrukerFluorescenceMicroscope",
                "description": "Version 5.6.64.400"
            }
        ],
        "ImagingPlane": [
            {
                "name": "ImagingPlane",
                "description": "The plane imaged at 5e-06 meters depth.",
                "excitation_lambda": NaN,
                "indicator": "unknown",
                "location": "unknown",
                "device": "BrukerFluorescenceMicroscope",
                "optical_channel": [
                    {
                        "name": "Ch2",
                        "emission_lambda": NaN,
                        "description": "An optical channel of the microscope."
                    }
                ],
                "imaging_rate": 30.345939461428763,
                "grid_spacing": [
                    1.1078125e-06,
                    1.1078125e-06
                ]
            }
        ],
        "TwoPhotonSeries": [
            {
                "name": "TwoPhotonSeries",
                "description": "Imaging data acquired from the Bruker Two-Photon Microscope.",
                "unit": "px",
                "imaging_plane": "ImagingPlane",
                "dimension": [
                    512,
                    512
                ],
                "format": "tiff",
                "scan_line_rate": 15840.580398865815,
                "field_of_view": [
                    0.0005672,
                    0.0005672,
                    5e-06
                ]
            }
        ]
    }
}

And so the culprits are revealed to be the excitation and emission lambdas. This should be a problem for any ophys interface, was Bruker the first you've tested?

If so, then I'd hold off for a second on trying any other ophys interfaces until we can resolve this

garrettmflynn commented 1 year ago

Yep my bad, just slipped from my mind.

Bruker was just the first I've tested, and you're correct that it shows up for all other ophys interfaces.

Yeah it looks like it's part of the roiextractors.py tool.

CodyCBakerPhD commented 1 year ago

@garrettmflynn Realizing this will be a problem as we add more ophys interfaces in the coming weeks

See https://stackoverflow.com/a/6602204/13616096, they seem to indicate NaN, though not a part of JSON, can be interpreted by JavaScript just fine with the right approach, which would be great since then we can consider NeuroConv as producing 'extended JSON' rather than 'pure JSON'

garrettmflynn commented 1 year ago

Is there a reason that it is unreasonable for NaN to be converted into None by NeuroConv? While it is possible to run eval('(' + '{"test":NaN}' + ')') to parse out the JS object from a non-compliant JSON string, it's still definitely hacky / outdated—as well as requires some knowledge of this decision by any API consumers.

CodyCBakerPhD commented 1 year ago

Is there a reason that it is unreasonable for NaN to be converted into None by NeuroConv?

None isn't the issue, mapping onto null is - would have to schematically define it to be either float or null, if you know how to do that really quickly/easily feel free to open direct PR to edit the base schema

Because the type of the field is a float - it makes sense for it to be NaN, at least in Python land. There I can dump and load NaN with a single flag

Probably a divisive issue for crossing into JS land, where even the JSON.parse package doesn't do that?

garrettmflynn commented 1 year ago

So JSON.stringify({test: NaN}) will produce {"test":null}, which is the string I'd want from the API. Loading NaN directly from the string would require using eval, which just treats the string as JS code (in which NaN is acceptable).

Here is how you'd declare that joint type in the schema: https://stackoverflow.com/questions/22565005/json-schema-validate-a-number-or-null-value

In this case, however, would it be possible to keep the schema the same and simply undeclare (i.e remove from the object) the key/value pairs provided as a NaN in the metadata? Are these fields actually required to begin with?

CodyCBakerPhD commented 1 year ago

Are these fields actually required to begin with?

Yeah, which is why we include it here in the first place: https://nwb-schema.readthedocs.io/en/latest/format.html#sec-imagingplane-src

Here is how you'd declare that joint type in the schema: https://stackoverflow.com/questions/22565005/json-schema-validate-a-number-or-null-value

OK, so for NeuroConv fix we would need to find all the right places to inject that - which could take a bit of time

So the question is, for the time being, what is the easiest/quickest fix? How long would it take to implement the hacky solution on the GUIDE side?

garrettmflynn commented 1 year ago

A good 20min for implementation and full tests. I'll do that now.

Is this going to only crop up for resolved metadata or other requests as well?

CodyCBakerPhD commented 1 year ago

Is this going to only crop up for resolved metadata or other requests as well?

I don't know the difference 😆

You might see NaN for any numeric value output from API calls involving the .get_metadata method called on an Interface/Converter

But this is the number one case of it and I can't even think of any others off the top of my head

garrettmflynn commented 1 year ago

Okay we should be good for now then.

h-mayorquin commented 2 months ago

It seems that this was fixed with #393? @CodyCBakerPhD

CodyCBakerPhD commented 2 months ago

393 is the next step in cloud development

But I think this was resolved over on GUIDE; it was an issue as I recall with interoperability of JSON validation between Java/TypeScript and Python