Closed jtlz2 closed 4 years ago
After the error message in the usage information you see the available transformations. Currently there are only alto2.0/alto2.1 to hocr transformations. Try to change the namespace of your file
+<alto xmlns="http://www.loc.gov/standards/alto/ns-v2#" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v3# http://www.loc.gov/alto/v3/alto-3-0.xsd">
-<alto xmlns="http://www.loc.gov/standards/alto/ns-v3#" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v3# http://www.loc.gov/alto/v3/alto-3-0.xsd">
and then use
ocr-transform alto2.0 hocr < alto.xml
I will create an issue upstream to support also other versions of alto in the transformations.
ALTO v3 & v4 should be supported now.
Have you tried https://github.com/filak/hOCR-to-ALTO/blob/master/alto__hocr.xsl ?
I have seen the changes because they broke our test case in one PR. Therefore we switched to a fixed commit instead of the newest version always. But this is still on my todo list. Maybe I can do this now...
The different files alto*hocr look almost identical. Therefore I would instead try to make a more generalized transformation altohocr which is applicable to alto files of different version. Should I try to do that as a PR?
We need to change then some things than here in order to integrate the new file names, but that can be done afterwards.
I can confirm the transformation of the @jtlz2's file works now with the latest version https://github.com/filak/hOCR-to-ALTO/blob/master/alto__hocr.xsl
cf #89
The below alto file is generated from Tesseract 4.1.0 and is described in its header as alto v3.0.
ocr-validate alto-3-0 alto.xml
is successful:
I would like to convert this file to hocr for use with hocr-tools, but ocr-transform throws an error when I try*:
ocr-transform alto hocr < alto.xml
(I have also tried various combinations of alto3.0/alto-3.0/alto-3-0...)
What am I doing wrong?
Here is the alto file:
(Image from here)