EMLParser misses validity problems

mbjones commented 5 years ago

The EMLParser fails to detect invalid EML documents during the test phase. Clean up the build and test system, and ensure that all files that are not following the EML rules fail the validation and that the tests detect that. This includes:

[x] Add invalid test files for id/references checks
[x] Add invalid files for annotations
[x] Add test output to understand what is failing
[x] Stop duplicating files in multiple locations unnecessarily

mbjones commented 5 years ago

I completely rewrote the validation in a new EMLValidator class that is massively more efficient. For some reason, the old EMLParser class was repeatedly parsing the XML document for every single XPath lookup (of which there were lots). There was also a lit of complexity in the validation code ad config files that was unneeded, and has now been boiled down to a simple validate function.

Need to compare the results of these tests with the independently implemented tests in the R emld package. @cboettig, could you please take a look and see if it makes sense to you how I did it in Java? And do you get the same results against the test files in the EML src/test/resources directory?

cboettig commented 5 years ago

@mbjones thanks! emld includes a copy of the src/test/resources directory already that it tests against. I'll sync over these updates and the test suite should mostly just pick these up. Will need to probably write a separate test script for the invalidEML cases. I'm a bit backlogged as usual but hope to get through that soon.

emld is currently nearly through rOpenSci review, which means we should be able to get a candidate on CRAN and update EML on CRAN relatively soon too (hoping to have this done close to when EML 2.2.0 ships!)

mbjones commented 5 years ago

Thanks @cboettig. You can see how I'm using the three directories for testing in the EMLValidator test, but it boils down to:

src/test/resources: contains valid EML documents exercising a bunch of features (I wish they were ore systematic about it, but its a hodgepodge)
src/test/resources/invalidEML: contains EML docs that have one (and only one) problem with one of the extended validation rules (these are still schema valid, just not following EMLs validation rules)
src/test/resources/moduleEML: contains docs that exercise one particular EML module and are not rooted at eml, so they wouldn't normally be considered valid EML, but they are schema valid against the XSD schemas

mbjones commented 5 years ago

After discussion, I have reintroduced the EMLParser class for backwards compatibility with older versions of EML (prior to 2.2.0). Now, calling EMLParser(file) and EMLParser(string) will delegate validation to EMLValidator if the version of the EML file is 2.2.0 or later. Otherwise, it uses the original, buggy logic and implementation within EMLParser. This will allow implementations that rely on the older, buggy logic to continue to function. However, new EML documents SHOULD be validated using the EMLValidator.validate() method to ensure they conform to all EML rules.

Because the old EMLParser still uses its original logic for versions 2.1.1 and before, it is still very slow. For version 2.2.0 and after, it should be faster now due to the new delegated implementation.

At this point, once people review the validation logic, this bug can be closed.

mbjones commented 5 years ago

Jing built and tested the validator and things seem to be working well in terms of backwards compatibility. There are still some issues around getting EML 2.2.0 into Metacat, but it seems the validator is working. See Jing's email below for details.

I built the datamanager-0.9.1.jar file on the eml 2.2 branch and used it to build a Metacat. Then I run the Metacat junit tests against the new Metacat. All tests passed except the expected two failures of the integration replication tests. This is good.

I also used Morpho 1.11.0 to generate EML objects on the Metacat and it worked. Then I used dataone command to create EML objects in the three scenarios:

Valid eml 2.1.1 object - the creation succeeded.

Invalid eml 2.1.1 object with a wrong element name - the creation failed

Invalid eml 2.1.1 object with duplicated id - the creation failed.

So the tests result shows the new datamanager-0.9.1.jar file in Metacat works good with released eml namespaces.

I also imported EML 2.2.0 schema to my local Metacat and tried the above three scenario with eml 2.2.0:

Valid eml 2.2.0 object - the creation succeeded.

Invalid eml 2.2.0 object with a wrong element name - the creation failed

Invalid eml 2.2.0 object with duplicated id - the creation succeeded.

The scenario 3 should fail but it succeeded. I took a look at Metacat code and found it is an issue on Metacat itself - it hasn't the mechanism to launch the eml parser for objects with the eml 2.2.0 namespace. I input a ticket for this issue:

https://github.com/NCEAS/metacat/issues/1316

I also set up an eml parser servlet service. But the service failed when I tried to parse an eml object. Please see this ticket:

https://github.com/NCEAS/eml/issues/331

mbjones commented 5 years ago

Closing validation as complete. If further bugs surface, please enter a new ticket to describe the issue.

NCEAS / eml

EMLParser misses validity problems #328