Certain unicode characters cause quality checking to fail during schemaValidDereferenced verification

servilla commented 3 years ago

Certain unicode characters cause quality checking to fail during the schemaValidDereferenced verification check, even though there is no "id/references" dereferencing that occurs.

EML supports a mechanism to identify a block of XML code so that it may be reused at a different location within the same document without having to repeat the same block content. The block source is identified with an id attribute. The reuse location is performed with the a <references> element where the element content is the id string literal. A common use of this the "id/references" pattern is with the responsibleParty element - the first time a responsibleParty element is define an id attribute would be declared. Then subsequent responsibleParty elements would be able to reference the id without having to redefine the entire responsibleParty block. For example:

<creator id="chase_gaucho">
    <individualName>
        <givenName>Chase</givenName>
        <surName>Gaucho</surName>
    </individualName>
</creator>
.
.
.
<contact>
    <references>chase_gaucho</references>
</contact>

To ensure that the reused content is EML schema valid, the quality check expands all references into a new EML document and then re-applies the schema validation. The issue occurs with the expansion phase of the original EML XML where certain unicode characters (set unknown) are converted into the corresponding HTML entity references for either UTF-8 or UTF-16.

For example, the unicode character "small italicized delta" (𝛿, U+1D6FF) is first converted from "𝛿" to the UTF-8 entity reference &#x1d6ff in the original source EML XML, and then the dereferencing expansion process converts the &#x1d6ff into the to the UTF-16BE entity reference &#55349;&#57087; (the decimal values of the UTF-16 byte sequence). It is the subsequent schema validation of this second conversion that results in an invalid EML XML document (although the first entity reference would also cause an error due to the UTF-8 entity reference replacement). This exception is completely non-related to the dereferencing validation check.

*Note that the exception message that is added to the schemaValidDereferenced quality check contains the raw encoding value "&#55349;" - it is the ampersand in this message that cause the quality report to fail XSLT conversion from XML to HTML.

Class references:

Initial call for the schemaValidDereferenced check: edu/lternet/pasta/dml/parser/generic/GenericDataPackageParser.java (see ~line 264: emlDataPackage.checkSchemaValidDereferenced(doc, emlNamespace);)
Dereferencing pipe-line: edu/lternet/pasta/dml/parser/DataPackage.java (see methods: checkSchemaValidDereferenced and dereferenceEML)

servilla commented 3 years ago

Attempts to coerce correct UTF-8 interpretation of the EML XML has failed. Currently, the only option to ensure acceptance of data packages with offending Unicode characters is to ignore this specific exception during schema validation of the dereferenced EML XML. Other exceptions during this phase (namely, incorrect validation) still result in the failed quality check.

servilla commented 3 years ago

A new problem is now evident: although the display of the offending EML works fine in my local development environment, it does not display on portal-d. Instead, it displays a similar character encoding error:

This seems to be an issue with Java 8's handling of the Unicode supplementary characters (see here: https://www.oracle.com/technical-resources/articles/javase/supplementary.html).

Options at this point:

Allow EML with such Unicode characters into PASTA with the knowledge that the full metadata may not display.
Allow the upload attempt of the EML with offending Unicode characters to fail at the schemaValidDereferenced quality check; the accompanying quality report should now display correctly.

I will go with option #2 at this time.

servilla commented 3 years ago

Need to determine subset of Unicode characters that cause this issue (supplementary character set begins after 0xFFFF).
Verify that data tables that contain Unicode characters are not affected by this issue.

servilla commented 3 years ago

Empirical testing indicates that PASTA will accept Unicode characters between 0x0000 and 0xFFFF, but fail with anything greater than 0xFFFF.

Tabluar data seem to not be affected by this same issue.

servilla commented 6 months ago

The cause of this particular issue is due to the xslt processing of the EML XML document string to ensure that id and reference dereferencing continue to produce schema-valid code. Because of Java's 16-bit character limitation on the internal representation of strings, unicode code points greater than 65,535 are broken into surrogate pairs. The schema validation of the UTF-8 surrogate pairs (for some reason) only inspects the high-order byte of the pair, which is not a valid character and results in an exception.

An opportunistic hack presented itself: change the output encoding of this particular test from UTF-8 to UTF-16 and let the saxon parser consume UTF-16 native characters - high and low bytes as one and behold, this works.

PASTAplus / PASTA

Certain unicode characters cause quality checking to fail during schemaValidDereferenced verification #36