etf-validator / etf-webapp

:earth_africa: :mag: ETF is an open source testing framework for spatial data and services
https://www.etf-validator.net
European Union Public License 1.2
18 stars 19 forks source link

Test suites are not able to access/use the imported schemas #197

Closed fabiovin closed 5 years ago

fabiovin commented 5 years ago

Description

Test suites are not able to access/use the schemas imported by the main schema declared in the attribute xsi:schemaLocation. For instance, if in the metadata the following value - xsi:schemaLocation="http://www.isotc211.org/2005/gmx http://schemas.opengis.net/iso/19139/20070417/gmx/gmx.xsd - is declared, the schema validation fails. The same metadata successfully validates with XMLSpy. The metadata test suite is not able to parse the gmd schema imported by the gmx one.

The same error also occurs with the schema http://inspire.ec.europa.eu/schemas/inspire_vs/1.0/inspire_vs.xsd, the WMS test suite is not able to retrieve the imported schema (http://schemas.opengis.net/wms/1.3.0/capabilities_1_3_0.xsd).

Related discussion: https://github.com/inspire-eu-validation/ets-repository/issues/244

Steps to Reproduce

  1. Open the INSPIRE Validator and select the "Conformance class: Metadata for interoperability" under the Metadata Test suite.
  2. Upload the attached xml file and validate it.

Expected behavior: Successfully schema validation (Schema validation: md-xml.a.1: valiadate XML documents).

Actual behavior: Error: "The dataset has 1 file(s) with errors for this assertion. XML document 'OSOpenNames_opengis.xml': The file has 1 schema validation error(s). XML document 'OSOpenNames_opengis.xml': 1:405: cvc-elt.1.a: Cannot find the declaration of element 'gmd:MD_Metadata'."

fabiovin commented 5 years ago

OSOpenNames_opengis.zip

cportele commented 5 years ago

I have not read through all of it, but I do not think this is an ETF error (*). If the schemaLocation is used in the validation (as it currently is in the INSPIRE ETSs), XML Schema requires that the namespace of the root element is explicitly included in the schemaLocation. This does not seem to be the case in the examples. That is, a gmd:MD_Metadata root element requires the declaration of the gmd namespace, a wms:WMS_Capabilities root element requires the explicit declaration of the wms namespace. Some parsers may be more lax or the validation tool has switched off some of the checks, but I think the reported errors are correct.

(*) Even if it is, it would be an issue with the version of Xerces-J that is included with Java and a fix would require a different implementation to XML schema validation (which is planned, but not yet scheduled).

PeterParslow commented 5 years ago

Hi Clemens, sorry to be awkward, but could you point me to something that backs up "XML Schema requires that the namespace of the root element is explicitly included in the schemaLocation." I can't find that in

In fact, a note within an example at https://www.w3.org/TR/2012/REC-xmlschema11-1-20120405/#xsi_schemaLocation implies the opposite "The namespace names used in schemaLocation can, but need not be identical to those actually qualifying the element within whose start tag it is found or its other attributes. For example, as above, all schema location information can be declared on the document element of a document, if desired, regardless of where the namespaces are actually used"

cportele commented 5 years ago

Hi Peter, I do not remember the details, but that was the conclusion of an in-depth analysis of https://www.w3.org/TR/xmlschema-1/#conformance in OGC around 2007 or so in the context of the discussions around XML Schema profiles, namespaces, versioning, etc. As mentioned, I do not remember the details, but I think it was quite complex. I will see, if I still find some notes about this.

It may be that this has changed in XML Schema 1.1 (which you are referring to), but most of the OGC and TC 211 specs are using XML Schema 1.0 (as does INSPIRE). The only exception that I am aware of is KML 2.3.

PeterParslow commented 5 years ago

It seems unchanged between XML Schema 1.0 and 1.1. Even the example that I quote was there in the earlier document. If you can find your notes, I would be interested.

I can't see that any of the schemas in question define which version of XML Schema they use, but I don't think that is relevant here.

cportele commented 5 years ago

I was not successful in finding anything from that time right now, I will see, if I find time next week to dig a bit deeper. I think it was something along the lines that the way the schema-validity assessment is specified (chapter 5), the parser will first access and validate the schemas (the gmx schema) listed in the xsi:schemaLocation attribute by following the URI, using its cache or any other allowed mechanism (sections 5.1 and 4.3.2), then try to identify a element declaration (in this case for the gmd:MD_Metadata element) "from the element declarations of the schema" (section 5.2). Since the gmx schema does not have such a declaration and at this stage in the validation process dependent schemas have not yet been processed, no such element declaration is found and this results in the reported error.

All of this is not very clear in the spec and the statements in the XML Schema spec in general are really hard to understand, but I think that was our conclusion at that time when analysing this type of validation error. A typical example was validating a gml:FeatureCollection document and only providing the GML application schema namespace in xsi:schemaLocation. This resulted in the same error and adding the GML namespace in the xsi:schemaLocation resolved these errors.

Regarding XSD versions: To indicate anything else but version 1.0 you typically use vc:minVersion (introduced after XML Schema 1.0). See https://www.w3.org/TR/xmlschema11-1/#cip or the KML example http://schemas.opengis.net/kml/2.3/ogckml23.xsd. But you are right, this is probably unchanged.

cportele commented 5 years ago

Peter, here is a link to an article from Rick Jelliffe (author of Schematron) that discussed this. It was part of the O'Reilly XML blog, but is no longer available. The Wayback machine still has it though:

https://web.archive.org/web/20060821202642/http://www.oreillynet.com/xml/blog/2006/08/why_we_will_always_have_proble.html

The issue is as discussed above with the statement that "the corresponding schema may be lazily assembled". The "lazy parser" may ignore the gmx schema reference in the root document as the root element is not in that namespace and throw an error as it does not have a definition of the element. The "eager parser" would simply try to load all schemas first and then assess the validity, which would include the gmd namespace. The XML documents should respect that parsers may use a lazy strategy.

I am closing the issue here as it is not an ETF issue.

PeterParslow commented 5 years ago

I can't really see that quoting a blog is the same as something being in the specification.

I do see this as an ETS issue, because it results in "false validation failures": where the instance is demonstrably valid XML, but because of the choice of validation approach, validation fails.

I can understand that this is an issue that 'we' agree not to fix, because it is in the 'off the shelf' validating parser. Although arguably, the choice of parser is up to 'us'.

Do we want to:

a) accept - and note for users - that this is an example of where instances may fail validation but still be valid? We always knew that that could be the case, but I think it would be helpful to users to know that we are aware of (some of) the reasons - otherwise, each user facing this validation failure then has to investigate why the failure has occured in order to decide whether to 'fix something' or prepare their organisational statement as to why they consider the instance is a valid INSPIRE metadata record.

It would help those users to know something like 'if your instance validates with another W3C conformant XML validating parser, then it may still be a valid INSPIRE metadata record'

OR

b) amend the Technical Guidance to 'insist' that metadata records contain an xsi:schemaLocation with this specific characteristic. In which case this would no longer be an ETS issue, but one of those cases where the metadata instance does not conform to the technical guidance - and of course the organisation (or country) is still allowed to state why they consider that it to be valid.

Effectively, do we want to document this issue in the Technical Guidance, in the ETS (somewhere e.g. the validation report), or just put the work on all the metadata editors to discover the issue?

cportele commented 5 years ago

I was just quoting the blog as additional information why the XML Schema specification allows parsers to report such XML instances documents as invalid as I have explained in my comment. This was in response to "If you can find your notes, I would be interested."

It may be an INSPIRE ETS issue depending on the requirements tested by the ETS, but right now it is not an ETF issue (this GitHub issue is in an ETF repository, not an INSPIRE Validator repository). If ETSs (from INSPIRE or elsewhere) require that test assertions control the schema-validity assessment process more than the standard parser allows then it may become an ETF issue in the future. But for now the questions you raise are for the MIG, not for ETF.

That said, I would think that any requirement in open environments like INSPIRE that mandates the use of a certain schema-validity assessment regime, in particular if it is not the default in standard software, is not a good idea. Which is why OGC changed its approach to XML schemas and namespaces around 2004/2005 (see Policy Directives 19 to 26) to support all assessment approaches as good as possible once we understood the implications of the schema-validity assessment rules in the XML Schema specification. But that is, of course, for the MIG to decide.

PeterParslow commented 5 years ago

Thanks for the explanation - in both the response to 'if you can find your notes', and also this one.

I hadn't spotted the difference between the ETS issues on GitHub and this one - the ETF issues.

Should this entire conversation have been in https://github.com/inspire-eu-validation/ets-repository? Like https://github.com/inspire-eu-validation/ets-repository/tree/master/metadata/xml

cportele commented 5 years ago

I think the idea for the INSPIRE validator is to have all discussions in https://github.com/inspire-eu-validation/community (see the readme), but @fabiovin should be able to help here.