Closed eamonnkenny closed 1 month ago
This actually works...
schema = requests.get('https://tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng').text
schema = schema.replace('<?xml version="1.0" encoding="UTF-8"?>', '<?xml version="1.0"?>', 1)
# load the schema into LXML
relaxng_doc = etree.parse(StringIO(schema))
Removing the encoding="UTF-8" allows the schema to load. Previously the encoding was written as "utf-8" in lower case, hence why my code broke after 8th July 2024.
@eamonnkenny Our .rng files are produced from an XSLT build process that outputs unicode. Your suggestion that we check for non-unicode characters by running character.encode("ascii")
would simply convert unicode characters to ascii and would show errors when unicode characters aren't in the ascii set--I don't think that's the issue you're seeing. That is, the TEI schema files did not suddenly on the July 8 release become unconformant with unicode.
You did find the meaningful change that broke your lxml etree processing in Python. Looking at our release and commit history confirms that we did introduce a change such that the xml encoding line on our released custom RNG files now reads
<?xml version="1.0" encoding="UTF-8"?>
From visiting the TEI vault we can find the RNG files from the November 2023 release show this line instead:
<?xml version="1.0" encoding="utf-8"?>
I found the change introduced in the TEI Stylesheets odd2relax.xsl line 61 in this commit from last fall which was included in the July 2024 release: https://github.com/TEIC/Stylesheets/commit/a5d1c3ca64450ca71aa61aafd8e1cddea52aa69d#diff-7d51841f44345b1b4dbbf5ab864e16edbb333bf3bdf437b72841578c8515db42
The change was simply to our XSL output line in odd2relax.xsl:
Before: <xsl:output encoding="utf-8" indent="yes" method="xml"/>
Now: <xsl:output encoding="UTF-8" indent="yes" method="xml"/>
We updated our XSLT stylesheets from 2.0 to 3.0 (including this one) close to the time this change was introduced, but I notice that not all of our XSLT files share this output encoding line--so I don't think we'd intended systematically to change all of the XSLT output encoding attributes, though perhaps we should for consistency.
The unintended consequence is the problem that you experienced with lxml parsing. You found a way to bypass this (yay), but there seems to be a different issue here with how to pass XML to lxml. This documentation from LXML might be helpful:
https://lxml.de/FAQ.html#why-can-t-lxml-parse-my-xml-from-unicode-strings
There's a post on "Coding Out Loud" about case sensitivity in encoding lines, pointing out why the upper-case UTF-8 is preferred: https://blog.codingoutloud.com/2009/04/08/is-utf-8-case-sensitive-in-xml-declaration/
Hope this helps.
This is pretty wild!
<?xml encoding="..."?>
), parsing is bound to fail, as this encoding is almost certainly not the real encoding used in Python unicode”, which sentence strongly implies that “Python unicode” is somehow not ISO 10646/Unicode compliant. Which would be weird, but is certainly feasible. But combined with “Note that Python uses different encodings for unicode on different platforms” makes me think that Python is re-coding input from (in our case) UTF-8 to some internal encoding (perhaps UTF-16 or UCS-4, which is called UTF-32 these days), but does not change the encoding declaration to match. (And there is no reason Python should or would change the content of a string because it is changing its encoding; but perhaps the lxml.tree code should change it, idunno.) I hope this helps, but even if it doesn’t, it has been an interesting ride.
P.S. Note that the reason, I suspect, that so many of our files used to use “utf-8” instead of “UTF-8” is that Emacs/nxml inserts the lowercase version. That was Sebastian’s favorite editor.
About this troubling matter of how Python processes XML, I agree it sounds nuts for us familiar with the XML stack . But there’s something true to how lxml has to process XML nodes in its documentation and requirements, so you do want to follow them. You may have a different experience with using SaxonC library for processing XML with XPath and XQuery in Python.
All TEI rng files e.g: tei_all.rng located at:
tei_all.rng
are broken for python linux loading into the lxml.tree parser because they contain non-utf8 characters. For instance after the word "Version" on line 9 there is a hidden character that looks a bit like an A. This is in most files.
Given that line 1 states that the .rng files are using encoding="UTF-8", this is a definite bug.
It is worth validating your validator files before releasing by running them through:
for all characters in your files. This will pinpoint all the issues for you.