OBOFoundry / purl.obolibrary.org

A system for managing OBO PURLs
BSD 3-Clause "New" or "Revised" License
75 stars 128 forks source link

Add schema checker for YAML files #267

Open kltm opened 7 years ago

kltm commented 7 years ago

We recently had an outage due to improper redirects. There should be schema checking for all files.

https://github.com/geneontology/minerva/issues/77

https://github.com/OBOFoundry/purl.obolibrary.org/commit/9994118a79124ffa80f128f7921f0698f284a429

jamesaoverton commented 7 years ago

Every YAML config file is checked against the config.schema.yml, and in a few places it checks the first part of a URI. Unfortunately the checks don't catch this particular problem. Suggestions are welcome.

cmungall commented 7 years ago

This was the offending PR:

https://github.com/OBOFoundry/purl.obolibrary.org/pull/264

It passed Travis-CI. Note a schema check wouldn't help here as it adhered to the schema.

It's also hard to test for redirect URLs in the general case, since a partial redirect is not guaranteed to work without a suffix.

We could do some preliminary checking on the URL to avoid errors like http:/// (three slashes) but it's hard to anticipate the dumb errors that users like @cmungall might make.

We do have a very nice test mechanism. We would do more to encourage (force?) cfgs to have tests, especially for infrastructure important ontologies like this one. That would have prevented this.

kltm commented 7 years ago

kwalify allows a regexp on fields; what are you using?

jamesaoverton commented 7 years ago

We're using kwalify with a regex, but the particular regex we're currently using did not reject that triple-slash: https://github.com/OBOFoundry/purl.obolibrary.org/blob/master/tools/config.schema.yml#L70

kltm commented 7 years ago

We now have something like this for something similar:

/^((ht|f)tp(s?)\:\/\/\w[\/\.\-\:\w]+)|(GOC\:[\w\_]+)$/

Maybe you could have something like:

/^(https?|ftp)\:\/\/\w[\/\.\-\:\w]+$/
kltm commented 7 years ago

For at least the slashes, I have put in a PR here: https://github.com/OBOFoundry/purl.obolibrary.org/pull/268

I also noticed a few places that could probably be improved, but I'm unsure if there are corner cases that would not work. For example the schema check for "products" has anything allowable, but it seems like it could at least be matched by something like:

/^[\w]+\:[\w]+\:\s*(https?|ftp)\:\/\/[a-zA-Z0-9][\/\.\-\:\?\=\&\#\%\!\$\~\w]+$/"

Are there cases, now or planned, where that wouldn't work?

Also, I've noticed that while

kwalify -f ./tools/config.schema.yml ./config/ddanat.yml

works, the use of "-E" causes failure

kwalify -E -f ./tools/config.schema.yml ./config/ddanat.yml

Looking at ddanat.yml, I'm actually a little sure how that works in the first place--it seems like somehow the non-"-E" version isn't catching what I'm seeing as bad YAML. I'd think that

products:
- ddanat.owl: http://ontologies.berkeleybop.org/ddanat.owl

should look like

products:
  - ddanat.owl: http://ontologies.berkeleybop.org/ddanat.owl

Again, with the latter being caught by the "-E" version of the command in the tests.

jamesaoverton commented 7 years ago

I agree that the keys and values for products should be checked. That's an oversight.

In this repo we're consistently using the unindented lists from your second-last example. All the tools I've tried handle it. I don't know why it would fail with the "-E"option.

kltm commented 7 years ago

Looking at products specifically in the schema, I was going to put in a regexp like:

/^[a-zA-Z0-9\.]+\:\s+(https?|ftp)\:\/\/[a-zA-Z0-9][\/\.\-\:\?\=\&\#\%\!\$\~\w]+$/"

However, that will not work as the structure that you're using is (to me) very odd. What you have here, for example looking at ro.yml as an example, is a sequence of single item objects. Is there any reason for this? Why have:

products:
  - ro.owl: https://raw.githubusercontent.com/oborel/obo-relations/master/ro.owl
  - ro.obo: https://raw.githubusercontent.com/oborel/obo-relations/master/ro.obo
  - ro.html: https://github.com/oborel/obo-relations

instead of the flatter

products:
  ro.owl: https://raw.githubusercontent.com/oborel/obo-relations/master/ro.owl
  ro.obo: https://raw.githubusercontent.com/oborel/obo-relations/master/ro.obo
  ro.html: https://github.com/oborel/obo-relations

Unless there is a reason I'm missing, it would make checking easier to just have the flatter structure. Or, if one wanted to check it like a string (e.g. my regexp)

products:
  "ro.owl: https://raw.githubusercontent.com/oborel/obo-relations/master/ro.owl"
  "ro.obo: https://raw.githubusercontent.com/oborel/obo-relations/master/ro.obo"
  "ro.html: https://github.com/oborel/obo-relations"
kltm commented 7 years ago

For kwalify and the -E issue, I looked a little at the YAML docs, and I don't see how it's legal to have the spacing as it is in the files right now. It feels a bit like resilience in the parsers, but maybe I'm missing something. For example, round-tripping ddanat.yml with standard nodejs tools (yaml2json and json2yaml) gives:

idspace: DDANAT
base_url: /obo/ddanat
products:
  - {ddanat.owl: 'http://ontologies.berkeleybop.org/ddanat.owl'}
term_browser: ontobee
example_terms:
  - DDANAT_0000002

(Note the object produced, re: https://github.com/OBOFoundry/purl.obolibrary.org/issues/267#issuecomment-264611173) The zero-spacing in the original just seems so odd. As well, I tried initially round-tripping ro.yml, but the parsing failed with the yaml2json, which is probably not a good sign.

cmungall commented 7 years ago

-1 to the 3rd.

I have no strong options between the equivalent 1 and 2, would need a good reason to switch all the configs and generator code etc. Wary of over-optimizing for kwalify which I don't love that much. If it's too hard to validate in kwalify a python script may be most declarative, portable and extendible. Could also be combined with additional tests - e.g. does the URL resolve.

cmungall commented 7 years ago

I guess 1 is kind of weird. YAML syntax masks it but it is odd.

jamesaoverton commented 7 years ago

Thanks @kltm, you make good points.

I'm not happy with the current format, but I'm not in a hurry to change it, either. For now, we can do better URL checking in the Python.

The config format started as a simple list, then grew more complex than planned https://github.com/OBOFoundry/purl.obolibrary.org/issues/100#issuecomment-159391144. I prototyped a simpler system #106. We might still want to merge the PURLs with the registry. I don't have time for big changes right now. Maybe our grant will come through. :^)