NCEAS / eml

Ecological Metadata Language (EML)
https://eml.ecoinformatics.org/
GNU General Public License v2.0
41 stars 15 forks source link

EML parser passes records which are missing customUnit definitions #355

Open mobb opened 4 years ago

mobb commented 4 years ago

The EML parser: https://knb.ecoinformatics.org/emlparser/parse no longer traps datasets which are missing their customUnit definitions in metadata. I.e., this passes the parser (no additionalMetadata section)

:

<eml:eml ...
  <dataset>
   ....
              <unit>
                <customUnit>millimetersPerYear</customUnit>
              </unit>
  ...
  </dataset>
</eml:eml>

Test doc is uploaded (with extension .txt for github) edi.9.0_2.1.1_custom_unit_missing.txt

mbjones commented 4 years ago

I can verify this for both EML 2.1.1 that @mobb sent, as well as for EML 2.2.0 documents. The issue seems to arise from customUnit using a different XPath selector for its references than other key/keyref pairs in the document, and so unit validation was skipped in the new EMLValidator class.

The original EMLParser class (now Deprecated) configuration included the following key/keyref pair check for units:

  <key name="unitIdentifierKey">
    <selector xpath="//unitList/unit"/>
    <field xpath="@id"/>
  </key>
  <keyref name="unitReference" refer="unitIdentifierKey">
    <selector xpath="//unit/customUnit"/>
    <field xpath="."/>
  </keyref>

This was not implemented in EMLValidator and needs to be added. I committed two test files to the develop branch (for 2.1.1 and 2.2.0) that illustrate the problem. The travis checks for EML on the develop branch will start failing until we fix this.

mbjones commented 4 years ago

I commited a fix for this to the develop branch in SHA 4cd1f9d7b8a2a553dd0c8d2ab415e3339118b342.

In the process, I found several erroneous test documents in the test suite, which are now also fixed in the branch. I updated the validation rules documentation to reflect this additional rule which got lost in translation from the previous release. The validation algorithm for units is now:

        // If a customUnit is referenced in a document, it must have a
        // corresponding STMML unit definition in the document with a matching
        // @id
        ArrayList<String> unitIdentifiers = getXPathValues("//*[local-name() = 'unitList']/*[local-name() = 'unit']/@id");
        ArrayList<String> unitReferences = getXPathValues("//unit/customUnit");
        for (String s : unitReferences) {
            if (!unitIdentifiers.contains(s)) {
                errors.add("Invalid: Custom unit definition missing: " + s);
                isValid = false;
            }
        }

@cboettig and @amoeba and @mobb Please review and let me know if that looks correct to you. If so, we'll need to be sure to add a corresponding validation check to emld in R.

This set of changes does not change the EML schema, but it does affect any system that might have done a validation check before accepting content, such as Metacat. @taojing2002 will need to incorporate a new built version of the eml.jar into Metacat, and we'll need to redeploy the online parser on the KNB. We will also need to redeploy the EML site documentation to reflect the updated validation documentation.

Comments appreciated.

mobb commented 4 years ago

This fix looks like it will catch docs with missing unit definitions.

HOWEVER: I am pretty sure that the previous parse behavior was to match the <customUnit> to the XPath unitList/unit/@name, and not to unitList/unit/@id. I only know this as a user, having been bit by it several times. Is there a way to confirm that?

IMO, id is a better choice. But if this is a change in behavior, we should be prepared for some docs that passed the parser previously to now fail.

mbjones commented 4 years ago

Hey @mobb thanks for the feedback. I accidentally started with @name, and found that produces validation errors in EMLParser. The XPath that I showed above from the configuration file shows that we previously used @id, so I stuck with that. The one thing I did do was make the XPath agnostic with respect to namespace, as I found live examples that were both explicit about the STMML namespace and others that omitted it. So, now both are allowed in my revised XPath via use of local-name().

mbjones commented 4 years ago

@taojing2002 When this upgrade is applied to Metacat on our repositories and for DataONE, it may find that documents that previously passed the EMLValidator no longer do so.... which concerns me, and we need a strategy. It would be interesting to find out which documents fail before we deploy it widely (which we might be able to do with a modified quality check with @gothub that runs the new validator on all DataONE content).

amoeba commented 3 years ago

Sam Csik found this example of a doc missing a few custom unit defs that passed Metacat's validation so I'll add it here https://search.dataone.org/view/doi:10.18739/A29882N5H.