UB-Mannheim / ocr-fileformat

Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)
https://digi.bib.uni-mannheim.de/ocr-fileformat/
MIT License
176 stars 23 forks source link

ocr-transform alto hocr: HTML, but xmlns=xhtml #184

Open jbarth-ubhd opened 4 months ago

jbarth-ubhd commented 4 months ago

Example input, gzip'd, base64:

QlpoOTFBWSZTWQ0I/UwAAvTfgERUUGf/97/n3sC/7//6UAVedhYMQaNAaaXQklNMRNU80yaan6ph
Q9J6mnlB6mjRtRoPKepoyAaSNT0j9UAMjQZAA00GgGgAAAkSVNlPSeoaA0xDQyMQDQyA0AaGg4yZ
MmIxMAJkwTIAaMIwBDAFSSaE0yTFMT0Jqj2pih6nlD0jygPUyGCMThMpEeCSRUBGUiRxfz+10Z7I
+0Y9KwRhSQ+tKkfWIpKhhZ887jv+f/E1/zrNFZihDCc0yJfNatKwMMcxblFLX9EiDV903RNeU07g
5HfRKyslJN+VTQ+U9cYNFYJojmaDOePfOYkCyupbBFQwAy9FAIS0Fw4PaXDjatC91faiJxJH7P4T
Dqy9xKB+jITgtHis3WoyDxTmshBE1KqxJ8ufqDfpcukQKDk1hfaExvTwx3rrSQ896qGuUNuFDE1D
SkL4KnXXecQmyAtmtCdCOcoiJzjJl8WK89FWzPZsfBi2zs47nh1O9rdyfKM3kyxrRn1MMGWCr3Yx
jd5WnJkz546K0a8WmE/aXtyfeCeHvr+W/T4o6+/FK+Ky9PkEsi16Sr/CR6aSqT2+W2CT2VPXTCoM
uXLAj3CkZlQUVI+n3+Pu9mqYWvfPJnUxRjkZlMEqllLFIQzMyZk93ZF0IbOK6unoY4iuJkORHct3
r3TpGHh2qSdT9LpATyBUcqxIB3DucJRAdhzMdeFygHAX41Cb2GGlZGx0/Pd5Mevw6N/fszNq+4/F
EYqO6o2hQJN2kFnUylknybsvbhVpYybg9Xb7ZHWkS41TXnGibTWZRhwcUmLB+TSjScK3ZPq6r568
Z6yDveTP6tcWHVAnYru78qRtMp+GJvY6M+T57rHCMkTNazOY7t7Okj9ULTeyDLMmTDRpj2srRl9N
ettH3KtVqOlzu4VGMg86p1gxoUDRDabbaBwEJKAY1notPaqu1C3pc0xZccWSfh/4JeYyhlgLUSyc
6TfHoO7jeqKeapZNitG05YQWKTDaTnvWzl5jJ5ke9JvS+c9RlOAdHVi/Tzt4Wy7pyTiVIZkzMyVM
QInIzp5asS2kyqZGu/ROVtJE54WYK+v2aTGYmjV4xm7WG3dhXAqk06dvLZVYlSVTm3TN4+nHB6Dw
3KwnNbS5TsCZbC3FUToHDrU9erN6GK695Zln4uWK+rdvziLyMvbZIma6Hq2TbEdia2S5+U3ZdSrs
IUpGjrCngl4w4aY3OdBg2W9LRyYZyMIq1ef9eqMqOCR3uc+gV0vfzYOXAXeM8F2PGQ5rRxHl1NK+
gpdvfBtuj7ju885uzzpm3l5sdaw3bJjqMbbQJMQxttJibb6KaUkKrwIr2HLLgy4F5xLJgluZVqrx
lpSVe0wXSKHQ6OLYymMWnq6LxvguijSKkwqPjPTZIC2TnaZkzs4tJMnDDwTE92NSzQSH0rIQMtGe
S25ZGcZMM5jUG05OGnNtqvo5YYJZ2YMtyTXTVJjuK1ZK8iXj3JtBw3J+LXibLdmzw5gmQrYgTF4d
lepMTGUNKSFXzKwSF/dlVvP40vSMEQJEYyaASoKSTrS6TYCkBZAihirCYiA1nY5kQSLCYcr0kKtC
pX+X/F3JFOFCQDQj9TA=

output (PoCoTo complains missing closing tags in \<meta> etc:

<!DOCTYPE HTML><html xmlns="http://www.w3.org/1999/xhtml">
   <head>
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
      <link rel="stylesheet" href="">
      <title>OCR Output</title>
      <meta name="description" content="OCR Output via XSLT of pageXML.">
   </head>
   <body>
      <div class="ocr_page" title="bbox 0 0 5553 7287; image 'image/OCR-D-IMG/00001.tif';"></div>
   </body>
jbarth-ubhd commented 4 months ago

(not your problem: html2xhtml writes 2x xmlns=; even if correcting this, PoCoTo complains missing page segmentation)

jbarth-ubhd commented 4 months ago

page→alto & alto→hocr works with PoCoTo