Open jbarth-ubhd opened 4 months ago
Example input, gzip'd, base64:
QlpoOTFBWSZTWQ0I/UwAAvTfgERUUGf/97/n3sC/7//6UAVedhYMQaNAaaXQklNMRNU80yaan6ph Q9J6mnlB6mjRtRoPKepoyAaSNT0j9UAMjQZAA00GgGgAAAkSVNlPSeoaA0xDQyMQDQyA0AaGg4yZ MmIxMAJkwTIAaMIwBDAFSSaE0yTFMT0Jqj2pih6nlD0jygPUyGCMThMpEeCSRUBGUiRxfz+10Z7I +0Y9KwRhSQ+tKkfWIpKhhZ887jv+f/E1/zrNFZihDCc0yJfNatKwMMcxblFLX9EiDV903RNeU07g 5HfRKyslJN+VTQ+U9cYNFYJojmaDOePfOYkCyupbBFQwAy9FAIS0Fw4PaXDjatC91faiJxJH7P4T Dqy9xKB+jITgtHis3WoyDxTmshBE1KqxJ8ufqDfpcukQKDk1hfaExvTwx3rrSQ896qGuUNuFDE1D SkL4KnXXecQmyAtmtCdCOcoiJzjJl8WK89FWzPZsfBi2zs47nh1O9rdyfKM3kyxrRn1MMGWCr3Yx jd5WnJkz546K0a8WmE/aXtyfeCeHvr+W/T4o6+/FK+Ky9PkEsi16Sr/CR6aSqT2+W2CT2VPXTCoM uXLAj3CkZlQUVI+n3+Pu9mqYWvfPJnUxRjkZlMEqllLFIQzMyZk93ZF0IbOK6unoY4iuJkORHct3 r3TpGHh2qSdT9LpATyBUcqxIB3DucJRAdhzMdeFygHAX41Cb2GGlZGx0/Pd5Mevw6N/fszNq+4/F EYqO6o2hQJN2kFnUylknybsvbhVpYybg9Xb7ZHWkS41TXnGibTWZRhwcUmLB+TSjScK3ZPq6r568 Z6yDveTP6tcWHVAnYru78qRtMp+GJvY6M+T57rHCMkTNazOY7t7Okj9ULTeyDLMmTDRpj2srRl9N ettH3KtVqOlzu4VGMg86p1gxoUDRDabbaBwEJKAY1notPaqu1C3pc0xZccWSfh/4JeYyhlgLUSyc 6TfHoO7jeqKeapZNitG05YQWKTDaTnvWzl5jJ5ke9JvS+c9RlOAdHVi/Tzt4Wy7pyTiVIZkzMyVM QInIzp5asS2kyqZGu/ROVtJE54WYK+v2aTGYmjV4xm7WG3dhXAqk06dvLZVYlSVTm3TN4+nHB6Dw 3KwnNbS5TsCZbC3FUToHDrU9erN6GK695Zln4uWK+rdvziLyMvbZIma6Hq2TbEdia2S5+U3ZdSrs IUpGjrCngl4w4aY3OdBg2W9LRyYZyMIq1ef9eqMqOCR3uc+gV0vfzYOXAXeM8F2PGQ5rRxHl1NK+ gpdvfBtuj7ju885uzzpm3l5sdaw3bJjqMbbQJMQxttJibb6KaUkKrwIr2HLLgy4F5xLJgluZVqrx lpSVe0wXSKHQ6OLYymMWnq6LxvguijSKkwqPjPTZIC2TnaZkzs4tJMnDDwTE92NSzQSH0rIQMtGe S25ZGcZMM5jUG05OGnNtqvo5YYJZ2YMtyTXTVJjuK1ZK8iXj3JtBw3J+LXibLdmzw5gmQrYgTF4d lepMTGUNKSFXzKwSF/dlVvP40vSMEQJEYyaASoKSTrS6TYCkBZAihirCYiA1nY5kQSLCYcr0kKtC pX+X/F3JFOFCQDQj9TA=
output (PoCoTo complains missing closing tags in \<meta> etc:
<!DOCTYPE HTML><html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <link rel="stylesheet" href=""> <title>OCR Output</title> <meta name="description" content="OCR Output via XSLT of pageXML."> </head> <body> <div class="ocr_page" title="bbox 0 0 5553 7287; image 'image/OCR-D-IMG/00001.tif';"></div> </body>
(not your problem: html2xhtml writes 2x xmlns=; even if correcting this, PoCoTo complains missing page segmentation)
page→alto & alto→hocr works with PoCoTo
Example input, gzip'd, base64:
output (PoCoTo complains missing closing tags in \<meta> etc: