Closed asor12 closed 4 years ago
I had a similar problem recently which was related to missing elements in the hOCR that are expected by the transformations. Unfortunately, a hOCR file passing validation does not seem to guarantee a successful conversion to ALTO (assuming yours validated). You should check https://github.com/filak/hOCR-to-ALTO/blob/master/hocr2alto2.0.xsl.
The missing page dimensions are related to the format of ocr_page's title attribute: the transformation expects something like title="image "wetzel_reisebegleiter_1901_0021_800px.jpg"; bbox 0 0 800 1095; ppageno 0"
see line 150.
Unfortunately I can't tell why you got the second file, did you try conversion to several alto versions and the second file is from another attempt? It looks to me that the first file is ok in principle but the transformation stopped at some point because there are a few assumptions about the hOCR file format. The second file looks somewhat broken (note the None
as first string before the actual OCR words follow). Maybe someone else can comment on it as I don't have experience with the GUI and it could be an artifact from using it as well.
Ok, thanks much, that's very helpful - I'll go check out the title attribute then. To make sure I understand, does that mean the transformation expects every "title" as in <div class='ocr_page' lang='unknown' title='bbox 0 0 1420 2068'>
to be in that longer format you described? I'm pretty new to XSL stylesheets, how would you recommend I fix this?
Yes, for some reason the GUI gives me multiple files each time I click download - from two to six files, some empty.
Thanks again!
Most of the content is missing in your ALTO files. It looks as you use a version of the transformation where it still expect ocr_par
after a ocr_carea
and before ocr_line
, which you are missing in your hocr-example and thus it stopped processing then. However, I extended this in the upstream repo https://github.com/filak/hOCR-to-ALTO/commit/30d286d0c7980af70081263c291c5f1a733aeb6c#diff-ef44efbc0c65a8ccb92c4768b944d6e5 . Possibly it is not yet updated here.
Most of the content is missing in your ALTO files. It looks as you use a version of the transformation where it still expect
ocr_par
after aocr_carea
and beforeocr_line
, which you are missing in your hocr-example and thus it stopped processing then. However, I extended this in the upstream repo filak/hOCR-to-ALTO@30d286d#diff-ef44efbc0c65a8ccb92c4768b944d6e5 . Possibly it is not yet updated here.
Ah! I was wondering about that and completely agree with @zuphilip. That is most likely the reason why the transformation stopped.
With respect to the format of the title
attribute: it only affects ocr_page
as for the extraction of the coordinates, multiple subattributes are expected to be encoded in the title with bbox
being the second one. This is what the function mf:getBoxPage
does.
Ok, thanks much. The only xsl stylesheet in the ocr_fileformat tool I find is xslt/alto2.0_alto3.0xsl, which looks very different from yours. I suppose I'll replace this stylesheet with yours and run again, if that sounds like the way to go!
This particular transformation is for alto2.0 -> alto3.0. There should be corresponding files like hocr__altoX.xsl with X being 2.0, 2.1, ... What kind of setup are you using exactly (which OS, docker or manual installation)?
Ah, I see in the Makefile that it was supposed to download all the xsl stylesheets. I am running on Mac OS, and I ran it as a docker using docker run --rm -it -p 8080:8080 ubma/ocr-fileformat
. I suppose the docker does not run the Makefile and install the extra stylesheets? Let me know if I installed it wrong.
Ok, I just updated the Docker image manually as changes in the external dependencies do not trigger a rebuild. Please try again with the new one (you might have to explicitly delete the old image using docker image rm [IMAGE ID]
).
Ok, thanks. I deleted the old image/container and ran docker run --rm -it -p 8080:8080 ubma/ocr-fileformat
again. I accessed the GUI via localhost:8080 and when I process the hocr file above, it gives me an empty xml file. Did I miss any steps?
This is the result ing ALTO-XML from your hOCR quoted in your first post after the fix:
<?xml version="1.0" encoding="utf-8"?>
<alto xmlns="http://www.loc.gov/standards/alto/ns-v2#" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xs="http://www.w3.org/2001/XMLSchema" xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v2# http://www.loc.gov/standards/alto/alto.xsd">
<Description>
<MeasurementUnit>pixel</MeasurementUnit>
<sourceImageInformation>
<fileName/>
</sourceImageInformation>
<OCRProcessing ID="IdOcr">
<ocrProcessingStep>
<processingSoftware>
<softwareName>gcv2hocr.py</softwareName>
</processingSoftware>
</ocrProcessingStep>
</OCRProcessing>
</Description>
<Layout>
<Page ID="" PHYSICAL_IMG_NR="1" HEIGHT="" WIDTH="">
<PrintSpace HEIGHT="" WIDTH="" VPOS="0" HPOS="0">
<ComposedBlock ID="" HEIGHT="1947" WIDTH="1244" VPOS="121" HPOS="176">
<TextLine ID="line_0" HEIGHT="47" WIDTH="69" VPOS="121" HPOS="678">
<String ID="word_0_0" CONTENT="2T" HEIGHT="47" WIDTH="69" VPOS="121" HPOS="678" WC="0"/>
</TextLine>
<TextLine ID="line_1" HEIGHT="34" WIDTH="189" VPOS="184" HPOS="383">
<String ID="word_1_0" CONTENT="Especially" HEIGHT="34" WIDTH="189" VPOS="184" HPOS="383" WC="0"/>
</TextLine>
<TextLine ID="line_2" HEIGHT="34" WIDTH="114" VPOS="184" HPOS="583">
<String ID="word_2_0" CONTENT="during" HEIGHT="34" WIDTH="114" VPOS="184" HPOS="583" WC="0"/>
</TextLine>
<TextLine ID="line_3" HEIGHT="27" WIDTH="53" VPOS="188" HPOS="722">
<String ID="word_3_0" CONTENT="the" HEIGHT="27" WIDTH="53" VPOS="188" HPOS="722" WC="0"/>
</TextLine>
<TextLine ID="line_4" HEIGHT="32" WIDTH="92" VPOS="186" HPOS="796">
<String ID="word_4_0" CONTENT="years" HEIGHT="32" WIDTH="92" VPOS="186" HPOS="796" WC="0"/>
</TextLine>
<TextLine ID="line_5" HEIGHT="34" WIDTH="73" VPOS="184" HPOS="904">
<String ID="word_5_0" CONTENT="1933" HEIGHT="34" WIDTH="73" VPOS="184" HPOS="904" WC="0"/>
</TextLine>
<TextLine ID="line_6" HEIGHT="31" WIDTH="70" VPOS="187" HPOS="1040">
<String ID="word_6_0" CONTENT="1938" HEIGHT="31" WIDTH="70" VPOS="187" HPOS="1040" WC="0"/>
</TextLine>
</ComposedBlock>
</PrintSpace>
</Page>
</Layout>
</alto>
PR for fixing the bug :bug: upstream is on the way: https://github.com/filak/hOCR-to-ALTO/pull/14
@asor12 If you are only interested in a single file transformation you can also use an online XSLT tool like http://xslttest.appspot.com/ and copy your hocr file in the first box and the temporary url https://raw.githubusercontent.com/zuphilip/hOCR-to-ALTO/patch-1/hocr2alto2.0.xsl in the second box, then press run transformation.
ok thanks much to both you and Jorg! I'll use http://xslttest.appspot.com/ as a quick fix for now. Quick note, I'm transforming from hocr to alto (I think that xsl is for alto to hocr?). I'll wait for the docker image update to process more of our files.
(Yes, I updated the link above.)
@asor12, it seems like the bug is fixed and the Docker image updated in case you want to try again.
Hi @jmechnich and @zuphilip thanks for updating! I was traveling last week and just got to test now. Here's a sample output ALTO code - does this look right to you? Does it need to be formatted some way, e.g. with newlines instead of one long string, or is this correct? (I haven't worked with ALTO files before, thanks for any guidance)
<?xml version="1.0" encoding="utf-8"?><alto xmlns="http://www.loc.gov/standards/alto/ns-v2#" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v2# http://www.loc.gov/standards/alto/v2/alto-2-0.xsd"><Description><MeasurementUnit>pixel</MeasurementUnit><sourceImageInformation><fileName>0</fileName></sourceImageInformation><OCRProcessing ID="IdOcr"><ocrProcessingStep><processingSoftware><softwareName>gcv2hocr.py</softwareName></processingSoftware></ocrProcessingStep></OCRProcessing></Description><Layout><Page ID="" PHYSICAL_IMG_NR="1" HEIGHT="" WIDTH=""><PrintSpace HEIGHT="" WIDTH="" VPOS="0" HPOS="0"><ComposedBlock ID="" HEIGHT="1947" WIDTH="1244" VPOS="121" HPOS="176"><TextLine ID="line_0" HEIGHT="47" WIDTH="69" VPOS="121" HPOS="678"><String ID="word_0_0" CONTENT="2T" HEIGHT="47" WIDTH="69" VPOS="121" HPOS="678"/></TextLine><TextLine ID="line_1" HEIGHT="34" WIDTH="189" VPOS="184" HPOS="383"><String ID="word_1_0" CONTENT="Especially" HEIGHT="34" WIDTH="189" VPOS="184" HPOS="383"/></TextLine><TextLine ID="line_2" HEIGHT="34" WIDTH="114" VPOS="184" HPOS="583"><String ID="word_2_0" CONTENT="during" HEIGHT="34" WIDTH="114" VPOS="184" HPOS="583"/></TextLine><TextLine ID="line_3" HEIGHT="27" WIDTH="53" VPOS="188" HPOS="722"><String ID="word_3_0" CONTENT="the" HEIGHT="27" WIDTH="53" VPOS="188" HPOS="722"/></TextLine><TextLine ID="line_4" HEIGHT="32" WIDTH="92" VPOS="186" HPOS="796"><String ID="word_4_0" CONTENT="years" HEIGHT="32" WIDTH="92" VPOS="186" HPOS="796"/></TextLine><TextLine ID="line_5" HEIGHT="34" WIDTH="73" VPOS="184" HPOS="904"><String ID="word_5_0" CONTENT="1933" HEIGHT="34" WIDTH="73" VPOS="184" HPOS="904"/></TextLine><TextLine ID="line_6" HEIGHT="31" WIDTH="70" VPOS="187" HPOS="1040"><String ID="word_6_0" CONTENT="1938" HEIGHT="31" WIDTH="70" VPOS="187" HPOS="1040"/></TextLine><TextLine ID="line_7" HEIGHT="34" WIDTH="49" VPOS="184" HPOS="1132"><String ID="word_7_0" CONTENT="the" HEIGHT="34" WIDTH="49" VPOS="184" HPOS="1132"/></TextLine><TextLine ID="line_8" HEIGHT="34" WIDTH="114" VPOS="184" HPOS="1202"><String ID="word_8_0" CONTENT="German" HEIGHT="34" WIDTH="114" VPOS="184" HPOS="1202"/></TextLine><TextLine ID="line_9" HEIGHT="34" WIDTH="61" VPOS="184" HPOS="1330"><String ID="word_9_0" CONTENT="un-" HEIGHT="34" WIDTH="61" VPOS="184" HPOS="1330"/></TextLine><TextLine ID="line_10" HEIGHT="35" WIDTH="187" VPOS="215" HPOS="196"><String ID="word_10_0" CONTENT="employment" HEIGHT="35" WIDTH="187" VPOS="215" HPOS="196"/></TextLine><TextLine ID="line_11" HEIGHT="33" WIDTH="69" VPOS="215" HPOS="394"><String ID="word_11_0" CONTENT="was" HEIGHT="33" WIDTH="69" VPOS="215" HPOS="394"/></TextLine><TextLine ID="line_12" HEIGHT="35" WIDTH="97" VPOS="215" HPOS="475"><String ID="word_12_0" CONTENT="fully" HEIGHT="35" WIDTH="97" VPOS="215" HPOS="475"/></TextLine><TextLine ID="line_13" HEIGHT="35" WIDTH="144" VPOS="215" HPOS="591"><String ID="word_13_0" CONTENT="removed." HEIGHT="35" WIDTH="144" VPOS="215" HPOS="591"/></TextLine><TextLine ID="line_14" HEIGHT="33" WIDTH="84" VPOS="215" HPOS="746"><String ID="word_14_0" CONTENT="Like" HEIGHT="33" WIDTH="84" VPOS="215" HPOS="746"/></TextLine><TextLine ID="line_15" HEIGHT="35" WIDTH="81" VPOS="215" HPOS="845"><String ID="word_15_0" CONTENT="many" HEIGHT="35" WIDTH="81" VPOS="215" HPOS="845"/></TextLine><TextLine ID="line_16" HEIGHT="35" WIDTH="111" VPOS="215" HPOS="947"><String ID="word_16_0" CONTENT="others" HEIGHT="35" WIDTH="111" VPOS="215"
Did you delete the last part or is this really all?
First of all, it should be a valid XML document, but what you wrote is incomplete (not all tags are closed etc.). The indenting does not matter, also it might be easier to read an XML document if it is "pretty printed".
Oh I truncated it as the file is really long. Yes, all of the tags should be closed. How would one "pretty print" the XML?
Thanks!
Okay. You can for example google "pretty print xml" and choose some online tool for it. In general the contents of your file looks fine. You can also validate the ALTO XML with our tool here.
Closing this issue because of inactivity. If the problem remains, then feel free to reopen it.
Hi, first thanks for making this tool.
I have questions using the GUI to convert hOCR to Alto XML.
My hOCR file looks as follows:
But the ALTO output from the GUI gives me two xml files, which look like this:
and
I've not worked with ALTO formats before, but I'm thinking it shouldn't look like this? Please let me know what you think, any help would be greatly appreciated!