UB-Mannheim / ocr-fileformat

Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)
https://digi.bib.uni-mannheim.de/ocr-fileformat/
MIT License
176 stars 23 forks source link

ocr-transform hocr text: &#x1b is an invalid XML character #185

Closed jbarth-ubhd closed 2 months ago

jbarth-ubhd commented 2 months ago
ocr-transform hocr text 010/01.tif/00100156_digit.hocr 
Error on line 19 column 27 of hocr__text.xsl:
  SXXP0003   Error reported by XML parser: Character reference "&#x1b" is an invalid XML
  character.: Character reference "&#x1b" is an invalid XML character.
org.xml.sax.SAXParseException; systemId: file:/usr/local/share/ocr-fileformat/xslt/hocr__text.xsl; lineNumber: 19; columnNumber: 27; Character reference "&#x1b" is an invalid XML character.

input file (gzip, base64):

H4sIAAAAAAACA71Y226jVhR9z1fsUqlMpTGcC5dDghm1yYxSadKOKkfp9MXC5iRBdcADJHb6Lf2U
vs2PdR8YGzh2piGumijyAZu172vtOHyzvlvAgyzKNM/GBrWIATKb50ma3YyNy8m7kTDeREfhN2e/
nE4+fngLtxV+/sPlj+9/OgVjZNtX/NS2zyZn8Nv55OI9IABMijgr0woB44Vtv/3ZOAL8MW6ranls
26vVylpxKy9u7Mmv9lrhUQXw5TiqOk9bSZUYaL02io5m5XgPDA2CoHnaUB86XsTKd5kZsD1FRxDe
yjjBVwirtFrIKLSbV3XnTlYxKOCR/HSfPoyN0zyrZFaNJo9LacC8uRoblVxXtjJ0Mr+Ni1JW4/vq
GjNktyhZfCfHZj4vRuVjWck7c/u0WcmylEU8r8C1HIuasPexebyMZ+kCMyDLzsP41nQZ30hQh3lc
yBiaW0X9ukiz+q31dJUXiTotpyt8+NpURkL7S/DhLE8ea6tJ+gDzRVyWLbQJaTI21WmKztXZGZvp
nTJqWDahBP+sKr22CZ4JdT11YZzAbJavgeAvpcQDz6cnsFQoWQ7kBMp5nE0LWQLjRP2ZyvyO/Tqi
xoHZIp//MaWtD0ZtgAVAHaDEZUA5M2oUCJf9IIpNDEUDUDeAmcj7Z2BBWC7jrIunktoAqtOuS0KQ
BodzcDDUWVzKuhBEBb6elumfErhjueoikeVcZglOGojmTry94VvuxokdL5qKNm6oU9cNs3UDI8Ka
YGQKua48MNeM7kJb4Q0AZ31wSqjoBblBp4EZXfbQ+xf/mky2UxPmNzUJeDeXFoY1YrxNKNfTqSVz
SCp5P1p0wnExXAK+aEPFnpVF8fh6cC4dLZeeAEcAcxwQtIUPEP8iLh4Hw7taH3BfpVC4AQivhXeF
GYVlVeTZTfT5LzTSHAdb8zRrAQUeQBAQEJ22cD0z8rnDBsP7et8R7DenbYitAezqY+v3QzqP62OM
HU58CLxA0cH/N8dCn2PNj+2ocTOilGpJ7V6E9rKhVRt5dS/BlhJJMa7yok+y2hBSoMxTPI5DwLkR
PY23h7D1tAL1g0YTmPsswmZPEvZerGdU2nkahnp9kiEOjGiHZZhWbO8Algm0QjduYKYZcdsyC2pG
5zkuCHLw8FCiTQ/SDEVZ4MoG5T0qu4rLXH73LQ9OyuF2NOnhSkEFBV9JD2U9O6d5gbK/zDFf2Vyu
0s9/DzeniZFPlTmO5IxiQZ1O6l5KoVQTAEG92oLwcVkhok/ScZYMN+DotKlC8GrerHuwa+CHm0y+
oCjujmSrINim0emOkh1Cnq7GGExZAIY5w8nUdNvFkWoZ1KHaSPn9kRKDdiBNjnQ/tp2B6veKWcsi
zSqLBk7w/X9PohrLcLfOOniu6lIxkEW1/GIwuJV4TS0d4j6HR/nTi+9+tGeU3dsP5CHBcLV09qiU
w0g8va/5BzAp1dYE5QdH6mbY7px1Oh0ZMeLHw2dJ02ROSR2n47lNnJ1JuojTZS6Hb4VUkwMn6Oay
07u+sxGE14dMrK8v2ipVoukBrv5b29ZupIrHcGhpWz3xdR2kfEj5GNH37caXOr1+n6henb+rEzy8
iIzuJphzB1wkde6RnpWPcTpcbpmmS26A7nMXPGXI64QhGBJuYuXXw01wXfp4bcIPtBg47sNX1gv1
nDm78qesNL3hdeTJZ7qiH9STQqNMFZ2HY4Ybv6N2os4KPvLaZgy+3oyDFIS5+qhrPnT3MkotyqwB
2rE5hHbzlUtYf28UHf0D+dc3bXETAAA=
stweil commented 2 months ago

Reverting https://github.com/filak/hOCR-to-ALTO/commit/ec3c27f5989920a8ec3b5313cd76cf8671691435 helps.

stweil commented 2 months ago

The conversion from alto to text has the same problem.