resource structure - difference between eml.xml file published through the IPT and 'does not validate against schema' flag in the data validator

CecSve commented 5 months ago

I recently looked into an archive for a publisher where the eml.xml file was tagges with The EML document does not validate against the schema with the following message Content is not allowed in prolog. in the data validator. I could not really see any issues with the file, so I tried publishing the archive through the test IPT and the archive published without any errors. So I replace the eml file in the original archive with the file published through the IPT, and the data validator no longer gives a flag for the eml file. I tried comparing the two files (attached as .txt since .xml is not supported), but besides the structure and content that was changed due to a new publication (language, data published etc.) I do not see difference. eml_ipt.txt eml_wrong.txt

Can someone explain to me what:

the difference is between the two that causes the flag in the validator
the flag description Content is not allowed in prolog. means?
the IPT does that corrects the error (is it documented in code?)

Mesibov commented 5 months ago

@CecSve, you may already have noticed this, but "wrong" begins with the byte order mark typical of UTF-16 big-endian files (U+FEFF) [but the file is UTF-8] and has carriage returns at the end of every line. "right" after IPT processing has neither a BOM nor CRs. "wrong" has its XML introductory text on one CRLF-ending line, "right" has separate lines for the XML declarations.

CecSve commented 4 months ago

Thank you for explaining this in detail @Mesibov! I could only see that the formatting was off and the CRLF endings so I really appreciate your input on this. I managed to get no errors in the data validator by doing the following in Notepad++:

The file was encoded as UTF-8-BOM so I changed that to UTF-16 BE BOM (just picked one, not aware of the difference)
Converted the file to UTF-8 and saved the file
Replaced the wrong eml.xml in the archive with the Notepad++ processed one
Ran the data validator tool
Fixed the error cvc-complex-type.2.4.b: The content of element 'associatedParty' is not complete. One of '{onlineUrl, userId, role}' is expected. by adding a an empty role for associatedParty <role>AssociatedParty</role> (since it seems to be required based in the EML schema and the IPT adds an empty role as well)
Fixed the cvc-enumeration-valid: Value ' daily ' is not facet-valid with respect to enumeration '[annually, asNeeded, biannually, continually, daily, irregular, monthly, notPlanned, weekly, unkown, otherMaintenancePeriod]'. It must be a value from the enumeration. and cvc-type.3.1.3: The value ' daily ' of element 'maintenanceUpdateFrequency' is not valid. by putting it in one line like so <maintenanceUpdateFrequency>daily</maintenanceUpdateFrequency>
Then there were no errors htrown for the EML file

It is also worth noting that the msessy formatting in the IPT will be adressed in the next EML release https://github.com/gbif/gbif-metadata-profile/issues/25

gbif / portal-feedback

resource structure - difference between eml.xml file published through the IPT and 'does not validate against schema' flag in the data validator #5352