NeurodataWithoutBorders / pynwb

A Python API for working with Neurodata stored in the NWB Format
https://pynwb.readthedocs.io
Other
174 stars 85 forks source link

validator erroneously catches non-vector experimenter and related_publications #1090

Open bendichter opened 4 years ago

bendichter commented 4 years ago

Description

We recently changed the schema to accept a vector for experimenter and for related_publications. Our pynwb io maps were changed to accept either form, but our validator is based entirely on the schema and will output validation errors for older files that use strings instead of vectors of strings for these fields.

Steps to Reproduce

from pynwb import validate, NWBHDF5IO
validate(NWBHDF5('old_file.nwb', 'r'))

Environment

Python Executable: Conda or Python
Python Version: Python 3.5, 3.6, or 3.7
Operating System: Windows, macOS or Linux
HDMF Version: Version of PyNWB used

Checklist

t-b commented 4 years ago

validator is based entirely on the schema and will output validation errors for older files that use strings instead of vectors of strings for these fields.

Yes and that is how it is supposed to be or? The schema is the reference.

bendichter commented 4 years ago

Yes, I was half explaining the divergent behavior for others and half wondering what everyone thought about this. It seems non-ideal to throw validation errors for older files that were according to spec when they were created and can be read perfectly well in pynwb and matnwb.

oruebel commented 4 years ago

We should be careful here not to mix backward compatibility of the API with backward compatibility of the schema. Testing whether the API can open a file or whether it is compliant with the latest schema are two separate questions.

For the validator, I think we have to take the schema as ground truth for validation. If we have a 2.1 file, then we should validate against the 2.1 schema. In that sense, we should check the version of the file and the schema we are validating against and issue a warning if they don't match, to let the users know that validation against different versions may result in warnings due to changes in the schema.

bendichter commented 4 years ago

@oruebel I agree these are two separate issues. While we aim for the API to support all NWB 2.0+ data, it could be important for a user to know precisely which version of the schema their file is valid against. I could see tool support for NWB where a tool would specify what versions of the NWB schema it supports (like pip versions). In that case, I could imagine a user wanting to know the validity of a file against a number of schemas, not just the latest one. What should be the workflow for this? Should the validator provide a way to specify a version of the schema to validate against?

t-b commented 4 years ago

@bendichter @oruebel

Just to help me understand our support: IIRC we agreed on the last hackathon at Janelia that we strive to include the schema into the NWB file when writing. And that got implemented in https://github.com/hdmf-dev/hdmf/pull/84.

So every NWB file created after that should have the schema it was written with by default. And the validation tool uses the shipped schema if present.

So if the schema changes in an incompatible way old nwb files with included schema still validate against the latest pynwb/hdmf/schema version or?

And pynwb 2.x can read all NWB data of version 2.x?

oruebel commented 4 years ago

we strive to include the schema into the NWB file when writing.

Correct and MatNWB also implements this now.

So every NWB file created after that should have the schema it was written with by default.

Correct.

And the validation tool uses the shipped schema if present.

I'm not sure what the default behavior of the validator is right now, i.e., whether it uses the schema form the file or from the version installed with PyNWB, but I agree that the default behavior should be to validate against the schema in the file.

And pynwb 2.x can read all NWB data of version 2.x?

Yes. One minor detail, PyNWB and the schema are versioned separately. I.e, PyNWB 1.x will be able to read any 2.x file as long as 2.x is less or equal to the schema version that PyNWB ships with.

So if the schema changes in an incompatible way old nwb files with included schema still validate against the latest pynwb/hdmf/schema version or?

If the schema changes in an incompatible way then (at least currently) this means that the old file will validate against the schema version it was generated with, but not necessarily against the latest schema as the incompatible changes would be seen as errors.

bendichter commented 4 years ago

Yes, we cache the schema by default, and recommend that users keep that default, but do not require it, so unless we want to make that required we'll need to handle the case that the schema is not cached. We also need to account for files that were written before caching the schema as an option.

t-b commented 4 years ago

@oruebel

I'm not sure what the default behavior of the validator is right now, i.e., whether it uses the schema form the file or from the version installed with PyNWB, but I agree that the default behavior should be to validate against the schema in the file.

$ python -m pynwb.validate --help
usage: validate.py [-h] [-p NSPATH] [-n NS]
                   [--cached-namespace | --no-cached-namespace]
                   paths [paths ...]

Validate an NWB file

positional arguments:
  paths                 NWB file paths

optional arguments:
  -h, --help            show this help message and exit
  -p NSPATH, --nspath NSPATH
                        the path to the namespace YAML file
  -n NS, --ns NS        the namespace to validate against
  --cached-namespace    Use the cached namespace (default).
  --no-cached-namespace
                        Don't use the cached namespace.

use --nspath to validate against an extension. If --ns is not specified,
validate against all namespaces in namespace file.

I did that change with the Andrew's help.

oruebel commented 4 years ago

we'll need to handle the case that the schema is not cached

Currently, I believe the approach for this would be that the user would have to check out the appropriate version of the schema from the nwb-schema repo (if the schema is not cached) to use for validation (or install the corresponding version of PyNWB). Alternatively, we'd need to ship all releases of the schema with PyNWB.

t-b commented 4 years ago

@oruebel How does the user know which schema version to fetch?

oruebel commented 4 years ago

How does the user know which schema version to fetch?

That is in the nwb_version attribute of the file

bendichter commented 4 years ago

We can access that using h5py or potentially h5dump, but I couldn't find a way to access nwb_version using pynwb

t-b commented 4 years ago

@oruebel Good. So in principle we could add support for the validator to fetch the appropriate schema from github and use that for validation. I'm not volunteering though ;)

oruebel commented 4 years ago

So in principle we could add support for the validator to fetch the appropriate schema from github

Yes, I believe that is true. Although it may be simpler to ship the different versions of the schema as package data with PyNWB directly, rather than fetching them from remote . @ajtritt @rly

yarikoptic commented 4 years ago

Yes, I believe that is true. Although it may be simpler to ship the different versions of the schema as package data with PyNWB directly

FWIW that was my non-verbalized in #1091 idea on how to deal with multiple version of a schema. Since they are largely identical text files, adding them directly into git should compress them nicely. I don't think you would want to breed submodules per each version.

oruebel commented 4 years ago

I don't think you would want to breed submodules per each version

The submodule is used for active development, i.e., to develop the schema and API in conjunction. Prior versions of the schema that have been released are not changing as part of the active development. Another question is, what do we need prior versions of the schema for? Currently the focus in this issue is squarely on validation (which is probably the main use-case for this).

But I think we are jumping ahead here a bit. First we need to resolve:

  1. How do we want to deal with different NWB versions on validation? I believe, here the answer is that we should by default validate against the version of the schema the file is created for.
  2. Do we need to ship prior versions of the schema with the API? Since this is only for validation, I don't think this is strictly needed. We could just ask the user to download the approbriate version of the schema if necessary, or going back to @t-b comment, we could just fetch them on-the-fly from GitHub if needed. One should only need to do this in very rare cases anyways since the majority of files should ship with the schema included.
rly commented 4 years ago

I am in favor of shipping prior versions of the schema with the API, even though it is only useful for validation. Doing so prevents the need for a validation tool to connect to the internet and for us to write tools to handle download and file management. The cost is that the file sizes would be slightly larger. The current schema is 82 KB (nwb.core) + 5 KB (hdmf.common).

oruebel commented 4 years ago

I am in favor of shipping prior versions

I think that's fine. I think it would be good to have that as part of either the setup or packaging process rather than as direct copies in the git repo.

yarikoptic commented 4 years ago

My 1c from an outsider: I would vote for the opposite (just to keep copies of different versions in git) since git would compress them efficiently and setup/packaging won't need to be unnecessarily overcomplicated and thus possibly more fragile.

yarikoptic commented 4 years ago

FWIW, for dandi validate I worked around as in https://github.com/dandi/dandi-cli/commit/9f2178f7005484da32412e638b75d33ac23dc4c1#diff-7e2d602e456be540d8ee58ead9ac917dR101

t-b commented 4 years ago

EDIT: wrong issue.