NeurodataWithoutBorders / pynwb

A Python API for working with Neurodata stored in the NWB Format
https://pynwb.readthedocs.io
Other
177 stars 84 forks source link

or just a question on difference with invocation of pynwb.validate #1321

Open yarikoptic opened 3 years ago

yarikoptic commented 3 years ago

@t-b asked and provided a sample file on dandi slack channel, which seems to validate "fine" if

$> python3 -m pynwb.validate 496035.05.02.nwb
Validating 496035.05.02.nwb against cached namespace information using namespace ndx-mies.
 - no errors found.

but errors are returned if

$> python3 -c 'import pynwb; print(pynwb.validate(pynwb.NWBHDF5IO("496035.05.02.nwb", "r")))'                      
[VectorIndex/description (general/intracellular_ephys/sweep_table/series_index): argument missing, VectorIndex (general/intracellular_ephys/sweep_table/series_index): incorrect type - expected 'uint8', got 'int32', VectorIndex/description (general/intracellular_ephys/sweep_table/series_index): argument missing, VectorIndex (general/intracellular_ephys/sweep_table/series_index): incorrect type - expected 'uint8', got 'int32']
and that is the ones we consider to be errors and report by `dandi validate` ```shell $> dandi validate 496035.05.02.nwb 496035.05.02.nwb: 5 error(s) VectorIndex/description (general/intracellular_ephys/sweep_table/series_index): argument missing VectorIndex (general/intracellular_ephys/sweep_table/series_index): incorrect type - expected 'uint8', got 'int32' VectorIndex/description (general/intracellular_ephys/sweep_table/series_index): argument missing VectorIndex (general/intracellular_ephys/sweep_table/series_index): incorrect type - expected 'uint8', got 'int32' Required field 'subject_id' has no value Summary: Validation errors in 1 out of 1 files ``` the last one is "dandi specific" since we do require subject_id value.

so why is such a difference and either we should somehow become more lenient in dandi?

edit 1: FWIW the same with load_namespaces=True ```shell $> python3 -c 'import pynwb; print(pynwb.validate(pynwb.NWBHDF5IO("496035.05.02.nwb", "r", load_namespaces=True)))' /home/yoh/proj/dandi/dandi-cli/venvs/dev3/lib/python3.8/site-packages/hdmf/spec/namespace.py:470: UserWarning: ignoring namespace 'hdmf-common' because it already exists warn("ignoring namespace '%s' because it already exists" % ns['name']) /home/yoh/proj/dandi/dandi-cli/venvs/dev3/lib/python3.8/site-packages/hdmf/spec/namespace.py:470: UserWarning: ignoring namespace 'core' because it already exists warn("ignoring namespace '%s' because it already exists" % ns['name']) [VectorIndex/description (general/intracellular_ephys/sweep_table/series_index): argument missing, VectorIndex (general/intracellular_ephys/sweep_table/series_index): incorrect type - expected 'uint8', got 'int32', VectorIndex/description (general/intracellular_ephys/sweep_table/series_index): argument missing, VectorIndex (general/intracellular_ephys/sweep_table/series_index): incorrect type - expected 'uint8', got 'int32'] ```
t-b commented 3 years ago

You can also just tell the validation routine to not use the cached spec and the validation error can be reproduced:

$ python -m pynwb.validate --no-cached-namespace 496035.05.02.nwb
Validating 496035.05.02.nwb against pynwb namespace information using namespace core.
 - found the following errors:
VectorIndex/description (general/intracellular_ephys/sweep_table/series_index): argument missing
VectorIndex (general/intracellular_ephys/sweep_table/series_index): incorrect type - expected 'uint8', got 'int32'
VectorIndex/description (general/intracellular_ephys/sweep_table/series_index): argument missing
VectorIndex (general/intracellular_ephys/sweep_table/series_index): incorrect type - expected 'uint8', got 'int32'
(base)
thomas@Win10-Thomas-PC MSYS /e
$ python -m pynwb.validate --cached-namespace 496035.05.02.nwb
Validating 496035.05.02.nwb against cached namespace information using namespace ndx-mies.
 - no errors found.
(base)
t-b commented 3 years ago

The series_index validation issue was introduced in https://github.com/hdmf-dev/hdmf-common-schema/commit/2851b46ae0025ceac9a0b35e933649cce90ab40f. Changing from a signed to unsigned integer is strictly speaking a breaking change.

I would say that the files MIES creates are still valid as they use a cached schema. But of course I can be persuaded otherwise.

t-b commented 3 years ago

Another odd thing is that the old type (int32) can hold all values of uint8 so even when validating against the stock pynwb schema this should not give a warning. AFAIR we allow bigger types instead of smaller ones.

t-b commented 3 years ago

@yarikoptic

The validation error

Failed to read metadata: Could not construct IntracellularElectrode object due to: IntracellularElectrode.__init__: incorrect type for 'description' (got 'bytes', expected 'str')

is from dandi or?

Can you explain what is wrong here?

The HDF5 file type of general/intracellular_ephys/electrode_0/description is text according to the schema 1. In the HDF5 file I'm storing that as

      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLPAD;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }

which is supported in all the other places where text is set as type.

t-b commented 3 years ago

Sample file: HardwareTests-V2-MIES-187d9143.zip

t-b commented 3 years ago

@yarikoptic I'm doing validation using pynwb with subprocess, see https://github.com/AllenInstitute/MIES/blob/bcb8f4ff185394729f99f5b31b3dfb9f18acbd59/tools/nwb-read-tests/nwbv2-read-test.py#L22. This at least avoids that pynwb loads the stock schema. The code might be brittle as I'm only using that in a docker container without real world users.

t-b commented 3 years ago

@yarikoptic @rly How do we proceed here? Should the fix be in pynwb or in dandi?

rly commented 3 years ago

@t-b @yarikoptic To summarize, if I understand correctly, the file is valid against the schema that is cached, but the schema was updated and the file is not valid against the latest schema. dandi validate uses the pynwb.validate(...) function which loads the schema from the active PyNWB installation, which may or may not be the latest schema or the one cached by the file. python -m pynwb.validate validates differently and provides more options than the pynwb.validate(...) function, including allowing validation against the cached schema (default).

In my opinion, DANDI should set a requirement for the minimum NWB version for validation, and it should validate against the cached schema if the version is >= the minimum requirement. It should not validate against the latest schema because older files are not guaranteed to be valid against the latest schema.

After upload of a given dandiset, DANDI should verify that all files in the dandiset use the same schema version of NWB and then display that version on the dandiset page so that users know whether the files can be read by tools that have a min version requirement.

Separately, PyNWB should include an option in the pynwb.validate(...) function where validation is done against the cached schema as opposed to the schema from the PyNWB installation. For the time being, DANDI should use the command-line interface python -m pynwb.validate to validate a file against the cached schema. If that is unworkable or inconvenient, I can provide sample code to run before pynwb.validate(...) to get it to validate against the cached schema.

rly commented 3 years ago
Failed to read metadata: Could not construct IntracellularElectrode object due to: IntracellularElectrode.__init__: incorrect type for 'description' (got 'bytes', expected 'str')

This looks like an issue with compatibility with h5py 3.0 where strings are read as bytes instead of str. Which version of h5py was installed?

t-b commented 3 years ago

This looks like an issue with compatibility with h5py 3.0 where strings are read as bytes instead of str. Which version of h5py was installed?

Yes this was a h5py issue indeed, going back to 2.10.0 solved it.