Closed servilla closed 6 months ago
Attempts to coerce correct UTF-8 interpretation of the EML XML has failed. Currently, the only option to ensure acceptance of data packages with offending Unicode characters is to ignore this specific exception during schema validation of the dereferenced EML XML. Other exceptions during this phase (namely, incorrect validation) still result in the failed quality check.
A new problem is now evident: although the display of the offending EML works fine in my local development environment, it does not display on portal-d. Instead, it displays a similar character encoding error:
This seems to be an issue with Java 8's handling of the Unicode supplementary characters (see here: https://www.oracle.com/technical-resources/articles/javase/supplementary.html).
Options at this point:
schemaValidDereferenced
quality check; the accompanying quality report should now display correctly.I will go with option #2 at this time.
Empirical testing indicates that PASTA will accept Unicode characters between 0x0000 and 0xFFFF, but fail with anything greater than 0xFFFF.
Tabluar data seem to not be affected by this same issue.
The cause of this particular issue is due to the xslt processing of the EML XML document string to ensure that id and reference dereferencing continue to produce schema-valid code. Because of Java's 16-bit character limitation on the internal representation of strings, unicode code points greater than 65,535 are broken into surrogate pairs. The schema validation of the UTF-8 surrogate pairs (for some reason) only inspects the high-order byte of the pair, which is not a valid character and results in an exception.
An opportunistic hack presented itself: change the output encoding of this particular test from UTF-8 to UTF-16 and let the saxon parser consume UTF-16 native characters - high and low bytes as one and behold, this works.
Certain unicode characters cause quality checking to fail during the schemaValidDereferenced verification check, even though there is no "id/references" dereferencing that occurs.
EML supports a mechanism to identify a block of XML code so that it may be reused at a different location within the same document without having to repeat the same block content. The block source is identified with an
id
attribute. The reuse location is performed with the a<references>
element where the element content is theid
string literal. A common use of this the "id/references" pattern is with theresponsibleParty
element - the first time aresponsibleParty
element is define anid
attribute would be declared. Then subsequentresponsibleParty
elements would be able to reference theid
without having to redefine the entireresponsibleParty
block. For example:To ensure that the reused content is EML schema valid, the quality check expands all references into a new EML document and then re-applies the schema validation. The issue occurs with the expansion phase of the original EML XML where certain unicode characters (set unknown) are converted into the corresponding HTML entity references for either UTF-8 or UTF-16.
For example, the unicode character "small italicized delta" (𝛿, U+1D6FF) is first converted from "𝛿" to the UTF-8 entity reference
𝛿
in the original source EML XML, and then the dereferencing expansion process converts the𝛿
into the to the UTF-16BE entity reference��
(the decimal values of the UTF-16 byte sequence). It is the subsequent schema validation of this second conversion that results in an invalid EML XML document (although the first entity reference would also cause an error due to the UTF-8 entity reference replacement). This exception is completely non-related to the dereferencing validation check.*Note that the exception message that is added to the
schemaValidDereferenced
quality check contains the raw encoding value "�" - it is the ampersand in this message that cause the quality report to fail XSLT conversion from XML to HTML.Class references:
schemaValidDereferenced
check:edu/lternet/pasta/dml/parser/generic/GenericDataPackageParser.java
(see ~line 264:emlDataPackage.checkSchemaValidDereferenced(doc, emlNamespace);
)edu/lternet/pasta/dml/parser/DataPackage.java
(see methods:checkSchemaValidDereferenced
anddereferenceEML
)