CustomXMLProvider internal XML check is flawed

ivaklinov commented 4 years ago

The sanity check performed at: https://github.com/ZUGFeRD/mustangproject/blob/a549213d95ebf48ea59378000682a5cd9000e842/src/main/java/org/mustangproject/ZUGFeRD/CustomXMLProvider.java#L31 https://github.com/ZUGFeRD/mustangproject/blob/a549213d95ebf48ea59378000682a5cd9000e842/src/main/java/org/mustangproject/ZUGFeRD/CustomXMLProvider.java#L32 Has some flaws:

The execution of String zf = new String(newData); creates a new String converting bytes to chars with the default system encoding, which may be different from the XML data encoding, so the result is OK most of the time but may be quite unexpected on weirdly configured OSes.

The check for the existence of '<rsm:CrossIndustry' is too strict. I have seen XML files from Intarsys samples that have no name space prefix at all, and yet are valid, because the XML starts as:

<CrossIndustryInvoice xmlns:qdt="urn:un:unece:uncefact:data:standard:QualifiedDataType:100" xmlns:ram="urn:un:unece:uncefact:data:standard:ReusableAggregateBusinessInformationEntity:100" xmlns:udt="urn:un:unece:uncefact:data:standard:UnqualifiedDataType:100" xsi:schemaLocation="urn:un:unece:uncefact:data:standard:CrossIndustryInvoice:100 file:///C:/Users/joerg/OneDrive/_work/VDA/Testdaten/XRechnung/XRechnung.xsd" xmlns="urn:un:unece:uncefact:data:standard:CrossIndustryInvoice:100" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<ExchangedDocumentContext>
    <ram:BusinessProcessSpecifiedDocumentContextParameter>
        <ram:ID>Beispielgeschäftsprozess</ram:ID>
    </ram:BusinessProcessSpecifiedDocumentContextParameter>
    <ram:GuidelineSpecifiedDocumentContextParameter>
        <ram:ID>urn:cen.eu:en16931:2017#conformant#urn:zugferd.de:2p0:extended</ram:ID>
    </ram:GuidelineSpecifiedDocumentContextParameter>
</ExchangedDocumentContext>
...

jstaerk commented 4 years ago

Hi, RE 1) I would welcome a default of UTF8, and there is some indication that this could be java default (https://docs.oracle.com/javase/tutorial/i18n/text/string.html) but feel free to contribute a test case and/or a PR, e.g. with new String(utf8Bytes, "UTF8");

RE 2) Thanks for raising this question, but please allow me to follow up, AFAIK the namespace prefixes are fixed for un/cefact, which I can show both by indication [neither (UN/CEFACT) Xrechnung (https://github.com/itplr-kosit/validator-configuration-xrechnung/blob/master/src/test/instances/cii001.xml) nor the CEN (https://github.com/ConnectingEurope/eInvoicing-EN16931/tree/master/cii/examples) has different prefixes and FacturX AFAIK had some in 1.0.2, but corrected them in 1.0.3)] and try to show by specification (http://www.unece.org/fileadmin/DAM/cefact/xml/XML-Naming-And-Design-Rules-V2_1.pdf unfortunately only says "will be" and not "must be") but I will use the occasion to ask and come back when I know more.

As a matter of fact, I had already raised this issue to the person I deem responsible for the examples and I still expect him to work on that (so far he said he didn't have time, not he wouldn't fix it).

PS: In no way I endorse fixed namespace prefixes, IMO it's a contra intuitive modification of basic XML defaults, hardly documented with questionable reasoning and even potentially contraproductive, but I would like to ask and confirm before I change this. If you are right this would also affect the validator, https://github.com/ZUGFeRD/ZUV/.

ivaklinov commented 4 years ago

Hi Jochen, RE-RE 1) The default encoding of a JVM is set via property file.encoding. This value is typically UTF-8 on Linux systems, it varies on Windows systems depending on locale etc. However there is no "java default" value. The reason why this problem has not been reported before (I suspect) is that almost all such encodings have 1-byte-per-char and the common chars of the Latin alphabet are in the same place A-Z, a-z (so they are ASCII-compatible for the common alphabet) etc. So in practice the string "<rsm:CrossIndustry" was always found in the document.

However he specification: http://www.unece.org/fileadmin/DAM/cefact/xml/XML-Naming-And-Design-Rules-V2_1.pdf says that:

[R198]All UN/CEFACT XML MUST be instantiated using UTF. UTF-8 should be used as the preferred encoding. If UTF-8 is not used,UTF-16 MUST be used.

if the XML data is UTF-16 encoded, this string conversion will also be improper on most systems because the file.encoding is most likely not going to be UTF-16 !

In addiiton in principal it is possible to have a default JVM encoding that is not ASCII-compatible. Over the years we have had in production cases where Java programs need to run on unusual IBM hardware with EBCDIC default encoding. IBM also confirms this is possible: https://www.ibm.com/support/knowledgecenter/en/ssw_ibm_i_71/rzaha/fileenc.htm (encodings Cp037 or Cp500). If the code runs on such a system the string "<rsm:CrossIndustry" will not be found (I am sure). So maybe fixing this point is not urgent but is a consideration for the future.

RE-RE 2) I may have been too quick to say that the ZUGFeRD XML starting with:

<CrossIndustryInvoice xmlns:qdt="urn:un:unece:uncefact:data:standard:QualifiedDataType:100" xmlns:ram="urn:un:unece:uncefact:data:standard:ReusableAggregateBusinessInformationEntity:100" xmlns:udt="urn:un:unece:uncefact:data:standard:UnqualifiedDataType:100" xsi:schemaLocation="urn:un:unece:uncefact:data:standard:CrossIndustryInvoice:100 file:///C:/Users/joerg/OneDrive/_work/VDA/Testdaten/XRechnung/XRechnung.xsd" xmlns="urn:un:unece:uncefact:data:standard:CrossIndustryInvoice:100" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    ...

is valid. I took this XML from the samples collection: https://github.com/ZUGFeRD/corpus/tree/master/ZUGFeRDv2/correct/intarsys and it was (evidently) created by some Intarsys product so I assumed it is valid. I will defer this matter to you to find out for sure if the name space prefix rsm: is a must. The way I read the specification: http://www.unece.org/fileadmin/DAM/cefact/xml/XML-Naming-And-Design-Rules-V2_1.pdf

Acronym (mandatory): The abbreviation of the type of component.In this case the value will always be RSM.
...
XML Naming and Design Rules, Version 2.1 Page 397
All root schemas published by UN/CEFACT will be assigned a unique token by BPS to represent the namespace prefix.This token will be prefixed by ‘rsm’

I think you are right ;-) and maybe the XML is not valid.

Generally about the topic of this bug ( 1) and 2) ): I think a better way to perform such a sanity check is to just start a SAX parser and look for the root element. Sadly I have no time to submit a PR now, but may do so in the future.

jstaerk commented 4 years ago

Your comments on Unicode are very interesting, FYI the ZF2 spec says only UTF8 is permitted on pg 17 (is the zf2 spec available in english at all already?). It is still something potentially to be fixed but maybe even more for ZUV. As regards the namespace prefix: 1) nobody can tell me where it comes from and 2) the CEN apparently had the same request and switched to supporting variable namespaces https://github.com/ConnectingEurope/eInvoicing-EN16931/issues/61 so I currently consider the fixed namespace a mere recommendation and yes, in that case yes, Mustang should support different ones.

jstaerk commented 3 years ago

Update: Since the PR https://github.com/ZUGFeRD/ZUV/pull/44/files raised this topic again I have upstreamed it to the ZUGFeRD guys because it does not make sense to allow this in the java code when the schematron check still has fixed namespace prefixes

ZUGFeRD / mustangproject

CustomXMLProvider internal XML check is flawed #140