TEIC / TEI

The Text Encoding Initiative Guidelines
https://www.tei-c.org
Other
275 stars 88 forks source link

Broken RNG files after July 8th 2024 release #2586

Closed eamonnkenny closed 1 month ago

eamonnkenny commented 1 month ago

All TEI rng files e.g: tei_all.rng located at:

tei_all.rng

are broken for python linux loading into the lxml.tree parser because they contain non-utf8 characters. For instance after the word "Version" on line 9 there is a hidden character that looks a bit like an A. This is in most files.

Given that line 1 states that the .rng files are using encoding="UTF-8", this is a definite bug.

It is worth validating your validator files before releasing by running them through:

for character in file_stored_as_a_single_string:
     try:
          character.encode("ascii")
          print( character )

for all characters in your files. This will pinpoint all the issues for you.

eamonnkenny commented 1 month ago

This actually works...

    schema = requests.get('https://tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng').text
    schema = schema.replace('<?xml version="1.0" encoding="UTF-8"?>', '<?xml version="1.0"?>', 1)

    # load the schema into LXML
    relaxng_doc = etree.parse(StringIO(schema))

Removing the encoding="UTF-8" allows the schema to load. Previously the encoding was written as "utf-8" in lower case, hence why my code broke after 8th July 2024.

ebeshero commented 1 month ago

@eamonnkenny Our .rng files are produced from an XSLT build process that outputs unicode. Your suggestion that we check for non-unicode characters by running character.encode("ascii") would simply convert unicode characters to ascii and would show errors when unicode characters aren't in the ascii set--I don't think that's the issue you're seeing. That is, the TEI schema files did not suddenly on the July 8 release become unconformant with unicode.

You did find the meaningful change that broke your lxml etree processing in Python. Looking at our release and commit history confirms that we did introduce a change such that the xml encoding line on our released custom RNG files now reads

<?xml version="1.0" encoding="UTF-8"?>

From visiting the TEI vault we can find the RNG files from the November 2023 release show this line instead:

<?xml version="1.0" encoding="utf-8"?>

I found the change introduced in the TEI Stylesheets odd2relax.xsl line 61 in this commit from last fall which was included in the July 2024 release: https://github.com/TEIC/Stylesheets/commit/a5d1c3ca64450ca71aa61aafd8e1cddea52aa69d#diff-7d51841f44345b1b4dbbf5ab864e16edbb333bf3bdf437b72841578c8515db42

The change was simply to our XSL output line in odd2relax.xsl: Before: <xsl:output encoding="utf-8" indent="yes" method="xml"/> Now: <xsl:output encoding="UTF-8" indent="yes" method="xml"/>

We updated our XSLT stylesheets from 2.0 to 3.0 (including this one) close to the time this change was introduced, but I notice that not all of our XSLT files share this output encoding line--so I don't think we'd intended systematically to change all of the XSLT output encoding attributes, though perhaps we should for consistency.

The unintended consequence is the problem that you experienced with lxml parsing. You found a way to bypass this (yay), but there seems to be a different issue here with how to pass XML to lxml. This documentation from LXML might be helpful:

https://lxml.de/FAQ.html#why-can-t-lxml-parse-my-xml-from-unicode-strings

There's a post on "Coding Out Loud" about case sensitivity in encoding lines, pointing out why the upper-case UTF-8 is preferred: https://blog.codingoutloud.com/2009/04/08/is-utf-8-case-sensitive-in-xml-declaration/

Hope this helps.

sydb commented 1 month ago

This is pretty wild!

  1. Yes, we changed our character encoding names from "utf-8" to "UTF-8" because although the former is generally accepted by most all XML software (although apparently not Python/lxml.tree) it is the latter that SHOULD be used per the XML Spec.
  2. The character after “Version” on line 9 is a U+00A0 (a NO-BREAK SPACE character). It is very commonly used in both XML and HTML applications. It is outside the ASCII range, but XML parsers are not reading ASCII, they are reading Unicode (in this case, in UTF-8). So I am not sure, @eamonnkenny, why removing the encoding declaration would change the behavior of the parser. Sounds like it might be a bug in Python/lxml.tree.
  3. About the lxml FAQ entry that @ebeshero referred to… With the caveats that I am not a Python programmer and I have not read the entire section carefully, I would be very cautious about taking advice from documentation whose first two sentences are so horribly incorrect. (XML is not a sequence of bytes, it is a sequence of characters; to summarize, the XML Spec says “XML processors MUST accept … any Unicode character, excluding the surrogate blocks, U+FFFE, and U+FFFF.”.) This FAQ entry also gives advice which outside the Python/lxml.tree world would generally be considered suspect if not outright nuts — namely to open an XML document as binary, not text. It further asserts that “if the unicode string declares an XML encoding internally (<?xml encoding="..."?>), parsing is bound to fail, as this encoding is almost certainly not the real encoding used in Python unicode”, which sentence strongly implies that “Python unicode” is somehow not ISO 10646/Unicode compliant. Which would be weird, but is certainly feasible. But combined with “Note that Python uses different encodings for unicode on different platforms” makes me think that Python is re-coding input from (in our case) UTF-8 to some internal encoding (perhaps UTF-16 or UCS-4, which is called UTF-32 these days), but does not change the encoding declaration to match. (And there is no reason Python should or would change the content of a string because it is changing its encoding; but perhaps the lxml.tree code should change it, idunno.)
  4. As @ebeshero pointed out, since our RELAX NG files are generated from XSLT, it would be quite difficult to get any malformed Unicode characters in there. Something could go awry in a file transfer along the way, but that would likely affect only 1 file, not a whole bunch. As for the copy of tei_all.rng you (@eamonnkenny) were using, it has 624,989 occurrences of 108 distinct Unicode characters, of which only 35 are occurrences of 15 distinct Unicode characters that are not also ASCII characters. See the character count for that file. Worth noting that the character counting routine I used would fail miserably if there were any malformed Unicode characters in the input.

I hope this helps, but even if it doesn’t, it has been an interesting ride.

P.S. Note that the reason, I suspect, that so many of our files used to use “utf-8” instead of “UTF-8” is that Emacs/nxml inserts the lowercase version. That was Sebastian’s favorite editor.

ebeshero commented 1 month ago

About this troubling matter of how Python processes XML, I agree it sounds nuts for us familiar with the XML stack . But there’s something true to how lxml has to process XML nodes in its documentation and requirements, so you do want to follow them. You may have a different experience with using SaxonC library for processing XML with XPath and XQuery in Python.