UB-Mannheim / ocr-fileformat

Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)
https://digi.bib.uni-mannheim.de/ocr-fileformat/
MIT License
176 stars 23 forks source link

Converting hOCR to Alto #96

Closed asor12 closed 4 years ago

asor12 commented 5 years ago

Hi, first thanks for making this tool.

I have questions using the GUI to convert hOCR to Alto XML.

My hOCR file looks as follows:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="unknown" lang="unknown">
  <head>
    <title>None</title>
    <meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
    <meta name='ocr-system' content='gcv2hocr.py' />
    <meta name='ocr-langs' content='unknown' />
    <meta name='ocr-number-of-pages' content='1' />
    <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_line ocrx_word ocrp_lang'/>
  </head>
  <body>
    <div class='ocr_page' lang='unknown' title='bbox 0 0 1420 2068'>
        <div class='ocr_carea' lang='unknown' title='bbox 176 121 1420 2068'>
            <span class='ocr_line' id='line_0' title='bbox 678 121 747 168; baseline 0 -5'>
                <span class='ocrx_word' id='word_0_0' title='bbox 678 121 747 168'>2T</span>
            </span>
            <span class='ocr_line' id='line_1' title='bbox 383 184 572 218; baseline 0 -5'>
                <span class='ocrx_word' id='word_1_0' title='bbox 383 184 572 218'>Especially</span>
            </span>
            <span class='ocr_line' id='line_2' title='bbox 583 184 697 218; baseline 0 -5'>
                <span class='ocrx_word' id='word_2_0' title='bbox 583 184 697 218'>during</span>
            </span>
            <span class='ocr_line' id='line_3' title='bbox 722 188 775 215; baseline 0 -5'>
                <span class='ocrx_word' id='word_3_0' title='bbox 722 188 775 215'>the</span>
            </span>
            <span class='ocr_line' id='line_4' title='bbox 796 186 888 218; baseline 0 -5'>
                <span class='ocrx_word' id='word_4_0' title='bbox 796 186 888 218'>years</span>
            </span>
            <span class='ocr_line' id='line_5' title='bbox 904 184 977 218; baseline 0 -5'>
                <span class='ocrx_word' id='word_5_0' title='bbox 904 184 977 218'>1933</span>
            </span>
            <span class='ocr_line' id='line_6' title='bbox 1040 187 1110 218; baseline 0 -5'>
                <span class='ocrx_word' id='word_6_0' title='bbox 1040 187 1110 218'>1938</span>
            </span>

But the ALTO output from the GUI gives me two xml files, which look like this:

<alto xmlns="http://www.loc.gov/standards/alto/ns-v2#"
      xmlns:xlink="http://www.w3.org/1999/xlink"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v2# http://www.loc.gov/standards/alto/v2/alto-2-0.xsd">
   <Description>
      <MeasurementUnit>pixel</MeasurementUnit>
      <sourceImageInformation>
         <fileName/>
      </sourceImageInformation>
      <OCRProcessing ID="IdOcr">
         <ocrProcessingStep>
            <processingSoftware>
               <softwareName>gcv2hocr.py</softwareName>
               <softwareVersion>gcv2hocr.py</softwareVersion>
            </processingSoftware>
         </ocrProcessingStep>
      </OCRProcessing>
   </Description>
   <Layout>
      <Page ID="" PHYSICAL_IMG_NR="1" HEIGHT="" WIDTH="">
         <PrintSpace HEIGHT="" WIDTH="" VPOS="0" HPOS="0">
            <ComposedBlock ID="" HEIGHT="1947" WIDTH="1244" VPOS="121" HPOS="176"/>
         </PrintSpace>
      </Page>
   </Layout>
</alto>

and

<?xml version="1.0" encoding="utf-8"?>
<alto xmlns="http://www.loc.gov/standards/alto/ns-v2#"
      xmlns:xlink="http://www.w3.org/1999/xlink"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v2# http://www.loc.gov/standards/alto/alto.xsd">None2TEspeciallyduringtheyears19331938theGermanun-employmentwasfullyremoved.LikemanyothershealsothoughtthatNationlasocialismvouldcauseaneconomicrisejoiningtheSAinApril1937Inforeigncountriestoo,Nationalsocialismwasnotrecognizedinitslasterfectsinthosedays.Imayremindyouofthefactthate.g.LordRothermeredevotedaspecialcopyofthe"DailyMailtotheNSDAPandaman1iaeMrWinstonChurchillwritesinhisreminiscences:"AtthattimeIhadnonationalprejudicesagainstHitler.Iknewbut1ittleofhisopinionoflifeandpastandhisoharacter.TomymindHitlerwasrighttobeaGerman1ovinghiscountry"Nodoubt,thatevenmoresuchorsimilarutterancesofstatesmenareknown.Atthattimemyhusbandcouldnotforeseethatbyhisjoininghewouldpromoteorsupportacriminalaffair.In1937hewasbusyasanassistantfortheknow-ledgeofkinsattheAnthropologicInstituteoftheUnivezaityofVienna.InSept.1937hepassedtothegeneralSS,becausehecouldbebusyasanivestigatorofkins.WaenAustriauasannexed,hecouldjointheGermanPolice.Afberyearsoftroublesanddistressnowhegotasafepoşitionasanofficial.Whenhewascalledouttothefrontier-guard(controlofpassports)onApril1st,1938hismembershiptothegeneralSSwasextinguished.HislatertransfertotheSDandtotheWafen-SS"wasnotvoluntary.DhusmyhusbanddoesnotbelongtotheciroleofthosemembersoftheSSwhomustbecosideredasCriminalsaccordingtothejudgementsofuremberg,becauseonlythosecounttothemwhoweremembersofthe3SstillfterSept.1st,1939.ThelatercompulsoryassimilationofranksintheSDandthe"Waffen-s"isotconsideredasamembershipothe3Saspertherulingpracticeofall"SpruchkammerInthecourseofageneraltraining-planinin1944myhusbandcametotheKRIPOforthreemonthstobeemployedthereforinformetionpurposes.ThenBourmonthsfollowedat.theSIAPOtobetrained1ateroninother1inesotheGeImanPolice.AstherewasalackofmenattheSTAPO,theycausedthepro-longationofhiscommendandinFebr.1945histransfertotheSTAPO.MyhusbandhasseveraltimestriedtoleavetheSTAFOandf1nallyappliedforbeingemployedasavoluateeratthefront.A1lhisapplicationswererefused.FurthertrialsWouldbeperhapspunishedasadenialofobedienceoradecompo-sitionof,themilitgry.ref.3)InFebr.andMarcha945asamemberoftheArmedForoesofthethenGermanymyhusbandshotdownanalliedterror-flyereachi.e.anenemyeirforce-manwhohadfiredabwomenandchildrenatBensheim/Germanyinalowflight,andthisonaccouatofadirectmilitaryandthereforebindingorderofhisdirectsuperior.Hewasorderedtodosobytheleaderofhisunit,SS-SourmbannführerandcouscillortothegovernmentGIRKEorbyhesdeputySS-sturmbannführerandcouncillortotheKRIPOHELLENBROICHresp.InFébr.1945Girkeaskedbyphonethecom-petentCommanderoftheSIPOSS-OberführerTRUMMLER,whethertheorderissuedfromBerlinbesti1lvalidbywhichterror-flyersweretobelki1led.TrummleransweredintheaffirmativeandP.t.o.</alto>

I've not worked with ALTO formats before, but I'm thinking it shouldn't look like this? Please let me know what you think, any help would be greatly appreciated!

jmechnich commented 5 years ago

I had a similar problem recently which was related to missing elements in the hOCR that are expected by the transformations. Unfortunately, a hOCR file passing validation does not seem to guarantee a successful conversion to ALTO (assuming yours validated). You should check https://github.com/filak/hOCR-to-ALTO/blob/master/hocr2alto2.0.xsl. The missing page dimensions are related to the format of ocr_page's title attribute: the transformation expects something like title="image &quot;wetzel_reisebegleiter_1901_0021_800px.jpg&quot;; bbox 0 0 800 1095; ppageno 0" see line 150. Unfortunately I can't tell why you got the second file, did you try conversion to several alto versions and the second file is from another attempt? It looks to me that the first file is ok in principle but the transformation stopped at some point because there are a few assumptions about the hOCR file format. The second file looks somewhat broken (note the None as first string before the actual OCR words follow). Maybe someone else can comment on it as I don't have experience with the GUI and it could be an artifact from using it as well.

asor12 commented 4 years ago

Ok, thanks much, that's very helpful - I'll go check out the title attribute then. To make sure I understand, does that mean the transformation expects every "title" as in <div class='ocr_page' lang='unknown' title='bbox 0 0 1420 2068'> to be in that longer format you described? I'm pretty new to XSL stylesheets, how would you recommend I fix this?

Yes, for some reason the GUI gives me multiple files each time I click download - from two to six files, some empty.

Thanks again!

zuphilip commented 4 years ago

Most of the content is missing in your ALTO files. It looks as you use a version of the transformation where it still expect ocr_par after a ocr_carea and before ocr_line, which you are missing in your hocr-example and thus it stopped processing then. However, I extended this in the upstream repo https://github.com/filak/hOCR-to-ALTO/commit/30d286d0c7980af70081263c291c5f1a733aeb6c#diff-ef44efbc0c65a8ccb92c4768b944d6e5 . Possibly it is not yet updated here.

jmechnich commented 4 years ago

Most of the content is missing in your ALTO files. It looks as you use a version of the transformation where it still expect ocr_par after a ocr_carea and before ocr_line, which you are missing in your hocr-example and thus it stopped processing then. However, I extended this in the upstream repo filak/hOCR-to-ALTO@30d286d#diff-ef44efbc0c65a8ccb92c4768b944d6e5 . Possibly it is not yet updated here.

Ah! I was wondering about that and completely agree with @zuphilip. That is most likely the reason why the transformation stopped.

With respect to the format of the title attribute: it only affects ocr_page as for the extraction of the coordinates, multiple subattributes are expected to be encoded in the title with bbox being the second one. This is what the function mf:getBoxPage does.

asor12 commented 4 years ago

Ok, thanks much. The only xsl stylesheet in the ocr_fileformat tool I find is xslt/alto2.0_alto3.0xsl, which looks very different from yours. I suppose I'll replace this stylesheet with yours and run again, if that sounds like the way to go!

jmechnich commented 4 years ago

This particular transformation is for alto2.0 -> alto3.0. There should be corresponding files like hocr__altoX.xsl with X being 2.0, 2.1, ... What kind of setup are you using exactly (which OS, docker or manual installation)?

asor12 commented 4 years ago

Ah, I see in the Makefile that it was supposed to download all the xsl stylesheets. I am running on Mac OS, and I ran it as a docker using docker run --rm -it -p 8080:8080 ubma/ocr-fileformat. I suppose the docker does not run the Makefile and install the extra stylesheets? Let me know if I installed it wrong.

jmechnich commented 4 years ago

Ok, I just updated the Docker image manually as changes in the external dependencies do not trigger a rebuild. Please try again with the new one (you might have to explicitly delete the old image using docker image rm [IMAGE ID]).

asor12 commented 4 years ago

Ok, thanks. I deleted the old image/container and ran docker run --rm -it -p 8080:8080 ubma/ocr-fileformat again. I accessed the GUI via localhost:8080 and when I process the hocr file above, it gives me an empty xml file. Did I miss any steps?

jmechnich commented 4 years ago

Well, you found another bug. :) The patch mentioned above by @zuphilip lacks opening <xsl:choose> tags here and here. If you are in a hurry you can modify the file inside the docker container by hand (in /usr/local/share/ocr-fileformat/xslt/).

jmechnich commented 4 years ago

This is the result ing ALTO-XML from your hOCR quoted in your first post after the fix:

<?xml version="1.0" encoding="utf-8"?>
<alto xmlns="http://www.loc.gov/standards/alto/ns-v2#" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xs="http://www.w3.org/2001/XMLSchema" xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v2# http://www.loc.gov/standards/alto/alto.xsd">
  <Description>
    <MeasurementUnit>pixel</MeasurementUnit>
    <sourceImageInformation>
      <fileName/>
    </sourceImageInformation>
    <OCRProcessing ID="IdOcr">
      <ocrProcessingStep>
        <processingSoftware>
          <softwareName>gcv2hocr.py</softwareName>
        </processingSoftware>
      </ocrProcessingStep>
    </OCRProcessing>
  </Description>
  <Layout>
    <Page ID="" PHYSICAL_IMG_NR="1" HEIGHT="" WIDTH="">
      <PrintSpace HEIGHT="" WIDTH="" VPOS="0" HPOS="0">
        <ComposedBlock ID="" HEIGHT="1947" WIDTH="1244" VPOS="121" HPOS="176">
          <TextLine ID="line_0" HEIGHT="47" WIDTH="69" VPOS="121" HPOS="678">
            <String ID="word_0_0" CONTENT="2T" HEIGHT="47" WIDTH="69" VPOS="121" HPOS="678" WC="0"/>
          </TextLine>
          <TextLine ID="line_1" HEIGHT="34" WIDTH="189" VPOS="184" HPOS="383">
            <String ID="word_1_0" CONTENT="Especially" HEIGHT="34" WIDTH="189" VPOS="184" HPOS="383" WC="0"/>
          </TextLine>
          <TextLine ID="line_2" HEIGHT="34" WIDTH="114" VPOS="184" HPOS="583">
            <String ID="word_2_0" CONTENT="during" HEIGHT="34" WIDTH="114" VPOS="184" HPOS="583" WC="0"/>
          </TextLine>
          <TextLine ID="line_3" HEIGHT="27" WIDTH="53" VPOS="188" HPOS="722">
            <String ID="word_3_0" CONTENT="the" HEIGHT="27" WIDTH="53" VPOS="188" HPOS="722" WC="0"/>
          </TextLine>
          <TextLine ID="line_4" HEIGHT="32" WIDTH="92" VPOS="186" HPOS="796">
            <String ID="word_4_0" CONTENT="years" HEIGHT="32" WIDTH="92" VPOS="186" HPOS="796" WC="0"/>
          </TextLine>
          <TextLine ID="line_5" HEIGHT="34" WIDTH="73" VPOS="184" HPOS="904">
            <String ID="word_5_0" CONTENT="1933" HEIGHT="34" WIDTH="73" VPOS="184" HPOS="904" WC="0"/>
          </TextLine>
          <TextLine ID="line_6" HEIGHT="31" WIDTH="70" VPOS="187" HPOS="1040">
            <String ID="word_6_0" CONTENT="1938" HEIGHT="31" WIDTH="70" VPOS="187" HPOS="1040" WC="0"/>
          </TextLine>
        </ComposedBlock>
      </PrintSpace>
    </Page>
  </Layout>
</alto>
zuphilip commented 4 years ago

PR for fixing the bug :bug: upstream is on the way: https://github.com/filak/hOCR-to-ALTO/pull/14

zuphilip commented 4 years ago

@asor12 If you are only interested in a single file transformation you can also use an online XSLT tool like http://xslttest.appspot.com/ and copy your hocr file in the first box and the temporary url https://raw.githubusercontent.com/zuphilip/hOCR-to-ALTO/patch-1/hocr2alto2.0.xsl in the second box, then press run transformation.

asor12 commented 4 years ago

ok thanks much to both you and Jorg! I'll use http://xslttest.appspot.com/ as a quick fix for now. Quick note, I'm transforming from hocr to alto (I think that xsl is for alto to hocr?). I'll wait for the docker image update to process more of our files.

zuphilip commented 4 years ago

(Yes, I updated the link above.)

jmechnich commented 4 years ago

@asor12, it seems like the bug is fixed and the Docker image updated in case you want to try again.

asor12 commented 4 years ago

Hi @jmechnich and @zuphilip thanks for updating! I was traveling last week and just got to test now. Here's a sample output ALTO code - does this look right to you? Does it need to be formatted some way, e.g. with newlines instead of one long string, or is this correct? (I haven't worked with ALTO files before, thanks for any guidance)

<?xml version="1.0" encoding="utf-8"?><alto xmlns="http://www.loc.gov/standards/alto/ns-v2#" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v2# http://www.loc.gov/standards/alto/v2/alto-2-0.xsd"><Description><MeasurementUnit>pixel</MeasurementUnit><sourceImageInformation><fileName>0</fileName></sourceImageInformation><OCRProcessing ID="IdOcr"><ocrProcessingStep><processingSoftware><softwareName>gcv2hocr.py</softwareName></processingSoftware></ocrProcessingStep></OCRProcessing></Description><Layout><Page ID="" PHYSICAL_IMG_NR="1" HEIGHT="" WIDTH=""><PrintSpace HEIGHT="" WIDTH="" VPOS="0" HPOS="0"><ComposedBlock ID="" HEIGHT="1947" WIDTH="1244" VPOS="121" HPOS="176"><TextLine ID="line_0" HEIGHT="47" WIDTH="69" VPOS="121" HPOS="678"><String ID="word_0_0" CONTENT="2T" HEIGHT="47" WIDTH="69" VPOS="121" HPOS="678"/></TextLine><TextLine ID="line_1" HEIGHT="34" WIDTH="189" VPOS="184" HPOS="383"><String ID="word_1_0" CONTENT="Especially" HEIGHT="34" WIDTH="189" VPOS="184" HPOS="383"/></TextLine><TextLine ID="line_2" HEIGHT="34" WIDTH="114" VPOS="184" HPOS="583"><String ID="word_2_0" CONTENT="during" HEIGHT="34" WIDTH="114" VPOS="184" HPOS="583"/></TextLine><TextLine ID="line_3" HEIGHT="27" WIDTH="53" VPOS="188" HPOS="722"><String ID="word_3_0" CONTENT="the" HEIGHT="27" WIDTH="53" VPOS="188" HPOS="722"/></TextLine><TextLine ID="line_4" HEIGHT="32" WIDTH="92" VPOS="186" HPOS="796"><String ID="word_4_0" CONTENT="years" HEIGHT="32" WIDTH="92" VPOS="186" HPOS="796"/></TextLine><TextLine ID="line_5" HEIGHT="34" WIDTH="73" VPOS="184" HPOS="904"><String ID="word_5_0" CONTENT="1933" HEIGHT="34" WIDTH="73" VPOS="184" HPOS="904"/></TextLine><TextLine ID="line_6" HEIGHT="31" WIDTH="70" VPOS="187" HPOS="1040"><String ID="word_6_0" CONTENT="1938" HEIGHT="31" WIDTH="70" VPOS="187" HPOS="1040"/></TextLine><TextLine ID="line_7" HEIGHT="34" WIDTH="49" VPOS="184" HPOS="1132"><String ID="word_7_0" CONTENT="the" HEIGHT="34" WIDTH="49" VPOS="184" HPOS="1132"/></TextLine><TextLine ID="line_8" HEIGHT="34" WIDTH="114" VPOS="184" HPOS="1202"><String ID="word_8_0" CONTENT="German" HEIGHT="34" WIDTH="114" VPOS="184" HPOS="1202"/></TextLine><TextLine ID="line_9" HEIGHT="34" WIDTH="61" VPOS="184" HPOS="1330"><String ID="word_9_0" CONTENT="un-" HEIGHT="34" WIDTH="61" VPOS="184" HPOS="1330"/></TextLine><TextLine ID="line_10" HEIGHT="35" WIDTH="187" VPOS="215" HPOS="196"><String ID="word_10_0" CONTENT="employment" HEIGHT="35" WIDTH="187" VPOS="215" HPOS="196"/></TextLine><TextLine ID="line_11" HEIGHT="33" WIDTH="69" VPOS="215" HPOS="394"><String ID="word_11_0" CONTENT="was" HEIGHT="33" WIDTH="69" VPOS="215" HPOS="394"/></TextLine><TextLine ID="line_12" HEIGHT="35" WIDTH="97" VPOS="215" HPOS="475"><String ID="word_12_0" CONTENT="fully" HEIGHT="35" WIDTH="97" VPOS="215" HPOS="475"/></TextLine><TextLine ID="line_13" HEIGHT="35" WIDTH="144" VPOS="215" HPOS="591"><String ID="word_13_0" CONTENT="removed." HEIGHT="35" WIDTH="144" VPOS="215" HPOS="591"/></TextLine><TextLine ID="line_14" HEIGHT="33" WIDTH="84" VPOS="215" HPOS="746"><String ID="word_14_0" CONTENT="Like" HEIGHT="33" WIDTH="84" VPOS="215" HPOS="746"/></TextLine><TextLine ID="line_15" HEIGHT="35" WIDTH="81" VPOS="215" HPOS="845"><String ID="word_15_0" CONTENT="many" HEIGHT="35" WIDTH="81" VPOS="215" HPOS="845"/></TextLine><TextLine ID="line_16" HEIGHT="35" WIDTH="111" VPOS="215" HPOS="947"><String ID="word_16_0" CONTENT="others" HEIGHT="35" WIDTH="111" VPOS="215" 
zuphilip commented 4 years ago

Did you delete the last part or is this really all?

First of all, it should be a valid XML document, but what you wrote is incomplete (not all tags are closed etc.). The indenting does not matter, also it might be easier to read an XML document if it is "pretty printed".

asor12 commented 4 years ago

Oh I truncated it as the file is really long. Yes, all of the tags should be closed. How would one "pretty print" the XML?

Thanks!

zuphilip commented 4 years ago

Okay. You can for example google "pretty print xml" and choose some online tool for it. In general the contents of your file looks fine. You can also validate the ALTO XML with our tool here.

zuphilip commented 4 years ago

Closing this issue because of inactivity. If the problem remains, then feel free to reopen it.