EDIorg / data-package-best-practices

Best Practices for data packages. a gh-pages website, with sections for metadata concepts and aspects of data packaging
https://ediorg.github.io/data-package-best-practices/
14 stars 6 forks source link

schemaLocation details #70

Open mbjones opened 3 years ago

mbjones commented 3 years ago

Great guide!

I note that section 4.1 recommends using schemaLocation with the XPath /eml:eml/@schemaLocation to help clients learn where to download schemas. Two issues:

  1. The XPath should be /eml:eml/@xsi:schemaLocation, as the element is part of the xsi namespace. It also would need to have the xsi namespace defined in an xmlns:xsi attribute on the root element as well.
  2. Technically, this element is truly an optional hint and can lead to security issues for clients that follow the location URI, which can lead to an XML injection attack. I think "best practice" would be for clients to provide their own, verified copies of the schema. At a minimum, the best practice should be that the schemaLocation URI used is the official EML namespace location at https://eml.ecoinformatics.org/eml-2.2.0
    • In addition, if you trust the xsi:schemaLocation for a document from eslewhere, it could point at a modified version of the schema, which might make the document actually invalid wrt to the official schemas. We have seen this frequently in DataONE with sites publishing their own variant schemas for ISO under the official namespace, and thereby losing the benefits of standardization. These nuances may be less compelling for people that want to quickly load a schema, so I understand if you want to keep the recommendation, but in our data centers we follow the best practice of explicitly omitting xsi:schemaLocation.
twhiteaker commented 3 years ago

We've been using xsi:schemaLocation="https://eml.ecoinformatics.org/eml-2.2.0 https://nis.lternet.edu/schemas/EML/eml-2.2.0/xsd/eml.xsd". Is that second part, the nis.lternet.edu portion, not needed?

Here's a more complete example. Hmm, I also notice we have @xmlns:eml with that same https://eml.ecoinformatics.org/eml-2.2.0 content, which seems redundant.

<eml:eml 
  xmlns:eml="https://eml.ecoinformatics.org/eml-2.2.0" 
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
  xmlns:stmml="http://www.xml-cml.org/schema/stmml-1.2" 
  xmlns:d1v1="NULL" 
  packageId="knb-lter-ble.18.2" 
  xsi:schemaLocation="https://eml.ecoinformatics.org/eml-2.2.0 https://nis.lternet.edu/schemas/EML/eml-2.2.0/xsd/eml.xsd"
  system="ble"/>
mbjones commented 3 years ago

@twhiteaker Thanks for following up. The xmlns:eml="https://eml.ecoinformatics.org/eml-2.2.0" associates the "eml" prefix with the right namespace, and is what allows the root <eml:eml> element (among others) to be properly namespaced. So it is needed.

The xsi:schemaLocation attribute takes two values: the namespace, and the schemaLocation URI. So, your example says that, whenever I find an element in the https://eml.ecoinformatics.org/eml-2.2.0 namespace, the parser can find the xsd file associated with that namespace at the location https://nis.lternet.edu/schemas/EML/eml-2.2.0/xsd/eml.xsd. This is the intended usage. But I would argue that it would be better to use: xsi:schemaLocation="https://eml.ecoinformatics.org/eml-2.2.0 https://eml.ecoinformatics.org/eml-2.2.0", which says to use the official location for the xsd file. Or, better yet, omit it altogether for the reasons I cited above.

twhiteaker commented 3 years ago

@mbjones Thanks for the clarification. I'm a minimalist so I'm all for omitting xsi:schemaLocation. If we do that...

I think "best practice" would be for clients to provide their own, verified copies of the schema.

A client would be a program consuming EML. So, if a data publisher omits xsi:schemaLocation, they don't have to then provide a copy of the schema. It's up the client. Did I get that right? I'm trying to determine what additional actions a data publisher may need to take if we omit xsi:schemaLocation.

Also, if we leave out xsi:schemaLocation, then we can also leave out xmlns:xsi, at least in my example above since I don't mention that namespace anywhere else.

mbjones commented 3 years ago

Hi @twhiteaker -- yeah, if you don't reference a namespace prefix like xsi in your document, then you can omit it.

And yes, I think you got it right on client responsibilities. In general, a client that is interpreting documents that it gets from the wild needs to control the schemas that are used to validate those documents. So the data providers' main job is to properly reference the namespace in their root element and in their document, and the client's job is to find a trusted copy of the schema that defines that namespace. Arbitrary URIs on the interwebs are not trusted sources of those schemas (we find many repositories that have made breaking changes to XSD documents and then posted them as if they were the original namespace). So, our client tooling is built where we provide a our own copies of the schemas which we get from the authoritative source (e.g., eml.ecoinformatics.org). Most client tools (like XML parsers and editors) have features to register your local trusted copy of an xsd for the tool to use (these are typically called "XML Catalogs").

twhiteaker commented 3 years ago

@cgries Do you know if leaving out schemaLocation or xmlns:xsi will break EDI's congruency checker?

cgries commented 3 years ago

@twhiteaker, no, EDI's congruency checker would not break, but it would add another step in Oxygen if you use that.

twhiteaker commented 3 years ago

I'm leaning toward omitting schemaLocation. Anyone in favor using xsi:schemaLocation="https://eml.ecoinformatics.org/eml-2.2.0 https://eml.ecoinformatics.org/eml-2.2.0" instead, or something else?

srearl commented 3 years ago

@twhiteaker CAP uses the trusted source that Matt detailed and that you put in your last comment (i.e., xsi:schemaLocation="https://eml.ecoinformatics.org/eml-2.2.0 https://eml.ecoinformatics.org/eml-2.2.0/eml.xsd"). I am intrigued by the suggestion to omit it altogether but will stick with that for the time being (and continue to think about this).

scelmendorf commented 3 years ago

Per @cgries comment I would find it mildly annoying to have it break Oxygen as that is my preferred "my xml is not valid but I can't figure out what I did wrong tool" (I find the r eml pkg error messages there not terribly iformative). If you DO omit it, what is the workaround for using Oxygen? Can we add that to the BP? There also may be some overlap to the issues here: https://github.com/ropensci/EML/issues/292

mbjones commented 3 years ago

@scelmendorf When I am working in Oxygen with EML (and other) schemas, I configure oxygen to use my local copy of the schemas, rather than trust that the document author provided a link to an unmodified version. Configuration is described here: https://www.oxygenxml.com/doc/versions/23.1/ug-editor/topics/using-XML-Catalogs.html If you set it up once, it will work with all EML documents, regardless of how people set schemaLocation.

twhiteaker commented 3 years ago

@scelmendorf and others with Oxygen concerns, does @mbjones's strategy work for you?

scelmendorf commented 3 years ago

Trying now: Most likely user error/failure to follow the instructions. But I added the schema to oxygen under preferences->xml->xml catalog, then deleted the xsi:schemaLocation from the eml xpath in my test document to see how this works. It doesn't now appear to be validating the xml, e.g I can put all sorts of bogus bits in there and it still says it's perfectly valid.