Encoding issue with the XML parser

tappoz commented 2 years ago

I need to write some fields with UTF-8 characters e.g. <Denominazione>Güügle</Denominazione>. I've taken a valid XML file with some values like the above after having retrieved it from the AdE website (Agenzia Delle Entrate). This XML file is SDI validated.

I would like to store the above encoding in the invoice XML file with some code using the a38 library.
I would like to generate a PDF from the XML file using the command a38tool pdf and the usual FoglioStileAssoSoftware.xsl from AssoSoftware.

However these two things don't seem to work well together.

Take this code snippet where it seems I could use the a38 XML builder instead of LXML:

import os

import a38.fattura

f = a38.fattura.FatturaPrivati12(
    fattura_elettronica_header=a38.fattura.FatturaElettronicaHeader(
        cessionario_committente=a38.fattura.CessionarioCommittente(
            dati_anagrafici=a38.fattura.DatiAnagraficiCessionarioCommittente(
                anagrafica=a38.fattura.Anagrafica(denominazione="Güügle"),
            )
        )
    )
)

real_path = os.path.realpath(__file__)
dir_path = os.path.dirname(real_path)

# this renders as `<Denominazione>G&#xFC;&#xFC;gle</Denominazione>`
filename_lxml = f"{dir_path}/foo-invoice-lxml.xml"
tree = f.build_etree(lxml=True)
with open(filename_lxml, "wb") as out:
    tree.write(out, encoding="utf8")
    # tree.write(out)

# THIS WORKS!!!
# this renders as `<Denominazione>Güügle</Denominazione>`
filename_other = f"{dir_path}/foo-invoice-other.xml"
tree = f.build_etree()
with open(filename_other, "wb") as out:
    tree.write(out, encoding="utf8")
    # tree.write(out)

This way I am able to store a XML file to the file system containing <Denominazione>Güügle</Denominazione>. All good.

When I invoke a command like:

 a38tool pdf \
         -f FoglioStileAssoSoftware.xsl \
         -o foo-invoice-other.pdf \
         foo-invoice-other.xml # <--- this is the XML file I generated with the Python code snippet above

Then I get this error:

ERROR uncaught exception
Traceback (most recent call last):
  File "<<...>>/bin/a38tool", line 360, in <module>
    main()
  File "<<...>>/bin/a38tool", line 353, in main
    res = app.run()
  File "<<...>>/bin/a38tool", line 256, in run
    f = self.load_fattura(pathname)
  File "<<...>>/bin/a38tool", line 31, in load_fattura
    tree = ET.parse(pathname)
  File "/usr/lib/python3.8/xml/etree/ElementTree.py", line 1202, in parse
    tree.parse(source, parser)
  File "/usr/lib/python3.8/xml/etree/ElementTree.py", line 595, in parse
    self._root = parser._parse_whole(source)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 42, column 36

That line 42, column 36 in the error is the umlaut in Güügle.

How can I both:

generate an XML file with UTF8 characters using the a38 library?
and generate a PDF file with the a38tool command?

Some context:

$ pip show lxml | grep Version
Version: 4.7.1
$ pip show a38 | grep Version
Version: 0.1.3
$ python --version
Python 3.8.10

One final thing: I did not get a valid invoice from the SDI (AdE) with the first row as <?xml version="1.0" encoding="utf-8"?> although it was indeed containing some UTF-8 encoded characters like above.

I found out from the AssoSoftware FAQ http://www.assosoftware.it/faq?catid=0&limit=10&start=50 that they recommend this encoding information to be the first line of the XML file. I am not sure if this is an issue with AdE/SDI or a too flexible interpetation, anyway here's the snippet:

FORMATO DI CODIFICA 22/02/2019

Le specifiche tecniche SDI della fattura XML non indicano obbligatoriamente l'utilizzo della codifica UTF-8 tuttavia è consigliata per una corretta interpretazione dei dati inseriti. Sono da evitare altre codifiche (es. ISO-8859, CP1252, ecc..) che obbligano a conversioni dei dati in fase di lettura del file. In apertura il file xml dovrebbe riportare come prima riga la versione del file xml e la codifica utf-8:

<?xml version="1.0" encoding="utf-8"?>

spanezz commented 2 years ago

This looks like a bug in a38. Could you send me a test invoice that validates with AdE and breaks a38? Then I can add it to the test suite and see about a fix

tappoz commented 2 years ago

Yeah the issue is hiding sensitive data but retaining a realistic XML structure. I can try over the weekend to provide a sanitised version of the XML by commenting here :crossed_fingers:

tappoz commented 2 years ago

Apologies for the late reply @spanezz @valholl I am still battling with the sanitised XML example. However I think I found some workarounds that could be useful:

If I take the VAT details of a company containing UTF-8 chars in their name e.g. from here on the VIES website https://ec.europa.eu/taxation_customs/vies/vatResponse.html Ørsted A/S with VAT number DK 36213728

Then I can use this function to flush UTF-8 XML to a file:

def flush_xml_to_file(fattura_a38: a38.fattura.FatturaPrivati12, filename: str):
    tree = fattura_a38.build_etree()
    # TODO default_namespace = "ns2" (instead of "ns0")
    with open(filename, "wb") as out:
        out.write(b'<?xml version="1.0" encoding="UTF-8" standalone="yes"?>\n')
        tree.write(out, encoding="utf8", xml_declaration=False)

See 3 things:

How I manually created the header <?xml version="1.0" encoding="UTF-8" standalone="yes"?> (apologies about my false claim in the comments above :point_up: this header is indeed contained in the XML coming from the "fattura elettronica" from AdE).
How the original examples in this A38 repo (and code snippets in the comments above) don't allow to flush XML with that attribute standalone="yes".
One final annoying side effect: I cannot find a way to set the namespace ns2. Although namespaces shouldn't matter, for some reason XML examples I find on the internet and the AdE website seem to prefer ns2 and I would like to be able to inject that in order to get back the exact copy of the original (valid) XML content from AdE.

With the workaround above I am able to generate a PDF equivalent of the XML with a command similar to:

a38tool pdf \
        -f <path-to>/styles/FoglioStileAssoSoftware.xsl \
        -o mia_fattura.pdf \
        mia_fattura_output.xml

It would be great to have in a38 a utility function like the above - it would be very handy.

spanezz commented 2 years ago

Hi, I really do not want to have to write the xml declaration by hand, and I suspect the core of the issue is that we need to pass encoding and xml_declaration=True to all tree.write calls.

I would like to add to the test suite test cases corresponding to the crash situations that you found (even if with made up xml files), and at that point try to see how to address them in as clean a way as possible.

With regards to namespaces, I would consider trying to get an exact copy of the original XML file a false goal, because there are so many changing aspects in XML encoding that we would need to reimplement an own XML parser in order to preserve the quirks of other encoders. I would keep the goal of, say, making it so that if I run xmlstarlet fo on the original and on the fattura coming out of a38, the results match. That is, that both XML files, when normalized, have the same content.

I can try to add some cases to the test suite (and it's really wrong that we call tree.write without an encoding and forcing an xml declaration), and then if you still see cases that break after those got addressed, we can see how to add more

spanezz commented 2 years ago

In tests/test_fattura.py there is a TestSamples test case that tries to load and save samples that are in tests/data/ with all codecs. I added a sample which has plenty of unicode, and so far it works.

Could you extend that test case to include the things that are failing for you?

If you need to tweak your samples to minimize anonymize them, you can try out the new a38tool edit feature :wink:

tappoz commented 2 years ago

Thanks a lot for these features - now I just need to find the time to try them on my scenario and report back the results :crossed_fingers: :smile:

(BTW on a separate note regarding the CI pipeline: it would be good to trigger that at each push to master - either direct or from a PR/branch merge. Also running those skipped tests about the encryption process that need a pre-processing step related to the certificates.).

tappoz commented 2 years ago

OK, now the UTF-8 encoding works :tada:

I've upgraded the package with pip install a38==0.1.5 (the release from April 1st 2022)

I've generated an XML file and flushed that to the file system with:

# it would be good to wrap this in a function that accepts an instance
# of `a38.fattura.FatturaPrivati12` and a file path, but fine doing it by hand
tree = fattura_a38.build_etree()
with open(filename, "wb") as out:
    tree.write(out, encoding="utf-8", xml_declaration=True)

I've formatted the XML file contents with xmllint and checked the differences with the original XML file from AdE with diff --color original.xml generated.xml

I just see the namespace differences now, so the UTF-8 encoding issue is fixed. Also there's no standalone="yes" in that XML header, which I added in the original issue description just because I was messing with those attributes. However, that is not even in the original document from AdE.

I had a look at the UTF-8 tests - they make sense, so given that my scenario now works I am glad I don't have to find a meaningful way to sanitise my XML to provide another example :blush:

Thanks!

Truelite / python-a38

Encoding issue with the XML parser #19