ecolabdata / ecospheres-isomorphe

Une application pour appliquer des transformations XML aux catalogues Geonetwork du MTECT.
0 stars 0 forks source link

Accented character changed by transform #73

Closed streino closed 3 weeks ago

streino commented 1 month ago

ecospheres-xslt/xslts/default-record-type.xsl

https://inspire.ternum-bfc.fr/geonetwork/srv/fre/catalog.search#/metadata/f97dec5c-aec2-4c75-9ed6-611ef49fd227

189c189
<    <gmx:FileName xmlns:gmx="http://www.isotc211.org/2005/gmx" src="https://www.ideobfc.fr/geonetwork/images/harvesting/Bloc marque DREAL_Bourgogne-Franche-Comté_RVB_HD.jpg">Logo</gmx:FileName>
---
>    <gmx:FileName xmlns:gmx="http://www.isotc211.org/2005/gmx" src="https://www.ideobfc.fr/geonetwork/images/harvesting/Bloc marque DREAL_Bourgogne-Franche-Comt&#xE9;_RVB_HD.jpg">Logo</gmx:FileName>

Happened only on a few records out of the 3,5K transformed. Didn't look into the cause, but looking at the above example, it might be related to XML attributes?

streino commented 1 month ago

I can't reproduce today... :/

streino commented 1 month ago

Not sure what happened but I can systematically reproduce now... And fix it by adding <xsl:output encoding="UTF-8"/> in the XSLT.

However I'm not exactly sure what's going on, because the input file is UTF-8, the lxml parser doesn't have any specific instructions, and etree.tostring is also set to UTF-8.

We can add <xsl:output> to all XSLTs but I'd feel better knowing why this happens on a few files and apparently only on attributes (text in that same XML contains accents and nothing's wrong with them...).

streino commented 3 weeks ago

Turns out lxml expects xsl:output for encodings other than ascii:

https://lxml.de/xpathxslt.html#xslt-result-objects

The result is always a plain string, encoded as requested by the xsl:output element in the stylesheet. If you want a Python Unicode/Text string instead, you should set this encoding to UTF-8 (unless the ASCII default is sufficient).