Open bendichter opened 4 years ago
validator is based entirely on the schema and will output validation errors for older files that use strings instead of vectors of strings for these fields.
Yes and that is how it is supposed to be or? The schema is the reference.
Yes, I was half explaining the divergent behavior for others and half wondering what everyone thought about this. It seems non-ideal to throw validation errors for older files that were according to spec when they were created and can be read perfectly well in pynwb and matnwb.
We should be careful here not to mix backward compatibility of the API with backward compatibility of the schema. Testing whether the API can open a file or whether it is compliant with the latest schema are two separate questions.
For the validator, I think we have to take the schema as ground truth for validation. If we have a 2.1 file, then we should validate against the 2.1 schema. In that sense, we should check the version of the file and the schema we are validating against and issue a warning if they don't match, to let the users know that validation against different versions may result in warnings due to changes in the schema.
@oruebel I agree these are two separate issues. While we aim for the API to support all NWB 2.0+ data, it could be important for a user to know precisely which version of the schema their file is valid against. I could see tool support for NWB where a tool would specify what versions of the NWB schema it supports (like pip versions). In that case, I could imagine a user wanting to know the validity of a file against a number of schemas, not just the latest one. What should be the workflow for this? Should the validator provide a way to specify a version of the schema to validate against?
@bendichter @oruebel
Just to help me understand our support: IIRC we agreed on the last hackathon at Janelia that we strive to include the schema into the NWB file when writing. And that got implemented in https://github.com/hdmf-dev/hdmf/pull/84.
So every NWB file created after that should have the schema it was written with by default. And the validation tool uses the shipped schema if present.
So if the schema changes in an incompatible way old nwb files with included schema still validate against the latest pynwb/hdmf/schema version or?
And pynwb 2.x can read all NWB data of version 2.x?
we strive to include the schema into the NWB file when writing.
Correct and MatNWB also implements this now.
So every NWB file created after that should have the schema it was written with by default.
Correct.
And the validation tool uses the shipped schema if present.
I'm not sure what the default behavior of the validator is right now, i.e., whether it uses the schema form the file or from the version installed with PyNWB, but I agree that the default behavior should be to validate against the schema in the file.
And pynwb 2.x can read all NWB data of version 2.x?
Yes. One minor detail, PyNWB and the schema are versioned separately. I.e, PyNWB 1.x will be able to read any 2.x file as long as 2.x is less or equal to the schema version that PyNWB ships with.
So if the schema changes in an incompatible way old nwb files with included schema still validate against the latest pynwb/hdmf/schema version or?
If the schema changes in an incompatible way then (at least currently) this means that the old file will validate against the schema version it was generated with, but not necessarily against the latest schema as the incompatible changes would be seen as errors.
Yes, we cache the schema by default, and recommend that users keep that default, but do not require it, so unless we want to make that required we'll need to handle the case that the schema is not cached. We also need to account for files that were written before caching the schema as an option.
@oruebel
I'm not sure what the default behavior of the validator is right now, i.e., whether it uses the schema form the file or from the version installed with PyNWB, but I agree that the default behavior should be to validate against the schema in the file.
$ python -m pynwb.validate --help
usage: validate.py [-h] [-p NSPATH] [-n NS]
[--cached-namespace | --no-cached-namespace]
paths [paths ...]
Validate an NWB file
positional arguments:
paths NWB file paths
optional arguments:
-h, --help show this help message and exit
-p NSPATH, --nspath NSPATH
the path to the namespace YAML file
-n NS, --ns NS the namespace to validate against
--cached-namespace Use the cached namespace (default).
--no-cached-namespace
Don't use the cached namespace.
use --nspath to validate against an extension. If --ns is not specified,
validate against all namespaces in namespace file.
I did that change with the Andrew's help.
we'll need to handle the case that the schema is not cached
Currently, I believe the approach for this would be that the user would have to check out the appropriate version of the schema from the nwb-schema repo (if the schema is not cached) to use for validation (or install the corresponding version of PyNWB). Alternatively, we'd need to ship all releases of the schema with PyNWB.
@oruebel How does the user know which schema version to fetch?
How does the user know which schema version to fetch?
That is in the nwb_version attribute of the file
We can access that using h5py or potentially h5dump, but I couldn't find a way to access nwb_version using pynwb
@oruebel Good. So in principle we could add support for the validator to fetch the appropriate schema from github and use that for validation. I'm not volunteering though ;)
So in principle we could add support for the validator to fetch the appropriate schema from github
Yes, I believe that is true. Although it may be simpler to ship the different versions of the schema as package data with PyNWB directly, rather than fetching them from remote . @ajtritt @rly
Yes, I believe that is true. Although it may be simpler to ship the different versions of the schema as package data with PyNWB directly
FWIW that was my non-verbalized in #1091 idea on how to deal with multiple version of a schema. Since they are largely identical text files, adding them directly into git should compress them nicely. I don't think you would want to breed submodules per each version.
I don't think you would want to breed submodules per each version
The submodule is used for active development, i.e., to develop the schema and API in conjunction. Prior versions of the schema that have been released are not changing as part of the active development. Another question is, what do we need prior versions of the schema for? Currently the focus in this issue is squarely on validation (which is probably the main use-case for this).
But I think we are jumping ahead here a bit. First we need to resolve:
I am in favor of shipping prior versions of the schema with the API, even though it is only useful for validation. Doing so prevents the need for a validation tool to connect to the internet and for us to write tools to handle download and file management. The cost is that the file sizes would be slightly larger. The current schema is 82 KB (nwb.core) + 5 KB (hdmf.common).
I am in favor of shipping prior versions
I think that's fine. I think it would be good to have that as part of either the setup or packaging process rather than as direct copies in the git repo.
My 1c from an outsider: I would vote for the opposite (just to keep copies of different versions in git) since git would compress them efficiently and setup/packaging won't need to be unnecessarily overcomplicated and thus possibly more fragile.
FWIW, for dandi validate
I worked around as in https://github.com/dandi/dandi-cli/commit/9f2178f7005484da32412e638b75d33ac23dc4c1#diff-7e2d602e456be540d8ee58ead9ac917dR101
EDIT: wrong issue.
Description
We recently changed the schema to accept a vector for experimenter and for related_publications. Our pynwb io maps were changed to accept either form, but our validator is based entirely on the schema and will output validation errors for older files that use strings instead of vectors of strings for these fields.
Steps to Reproduce
Environment
Checklist