PRImA-Research-Lab / prima-page-converter

Command line tool to convert page layout files to the latest PAGE XML format. It supports all previous versions of the PAGE format as well as ALTO XML, FineReader XML, and HOCR
Apache License 2.0
23 stars 6 forks source link

hOCR doc not properly converted when lacking certain typesettings #18

Open sven-nm opened 3 years ago

sven-nm commented 3 years ago

Hi all,

The html-code below is the beginning of an hOCR-file. It has been hOCR-validated with hocr-spec.

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
        "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns:html="http://www.w3.org/1999/xhtml">
<head>
    <meta name="DCTERMS.contributor" content="Bruce Robertson (OCR processing)"/>
    <meta name="DCTERMS.description"
          content="OCR output of page images processed through the ciaconna OCR system, which in turn is based on OCRopus."/>
    <meta name='ocr-capabilities' content='ocr_page ocr_line ocrx_word'/>
    <meta name='ocr-system' content='custom'/>

</head>
<body>
<div class="ocr_page"
     title="image sophoclesplaysa05campgoog_0012.png; image sophoclesplaysa05campgoog_0012.png; bbox 0 0 3030 5024"
     style="writing-mode: horizontal-tb;">
            <span class="ocr_line" id="line_0"
                  title="bbox 312 1109 923 1169; x_bboxes 313 1109 313 1169 313 1169 313 1109 361 1109 361 1169 361 1169 361 1109 409 1109 409 1169 409 1169 409 1109 457 1109 457 1169 457 1169 457 1109 505 1109 505 1169 505 1169 505 1109 553 1109 553 1169 553 1169 553 1109 585 1109 585 1169 585 1169 585 1109 634 1109 634 1169 634 1169 634 1109 682 1109 682 1169 682 1169 682 1109 698 1109 698 1169 698 1169 698 1109 746 1109 746 1169 746 1169 746 1109 794 1109 794 1169 794 1169 794 1109 858 1109 858 1169 906 1169 906 1109 922 1109 922 1169 922 1169 922 1109">
                    <span class="ocrx_word" id="segment_0" title="bbox 313 1109 799 1169" data-min-confidence="0.76"
                          data-average-confidence="0.96" data-manually-confirmed="false" data-spellcheck-mode="None"
                          data-selected-form="INTRODCCTIO&#x39D;">INTRODCCTIOΝ</span>
                    <span class="ocrx_word" id="segment_2" title="bbox 915 1109 922 1169" data-min-confidence="1.0"
                          data-average-confidence="1.0" data-manually-confirmed="false" data-spellcheck-mode="Numerical"
                          data-selected-form=".">.</span>
        </span>

In this state however, prima-page-converter fails to render any line below metadata.

The problem can be solved by adding a global ocr_par nested in a global ocr_carea itself nested in the ocr_page area, like such :

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
        "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns:html="http://www.w3.org/1999/xhtml">
<head>
    <meta name="DCTERMS.contributor" content="Bruce Robertson (OCR processing)"/>
    <meta name="DCTERMS.description"
          content="OCR output of page images processed through the ciaconna OCR system, which in turn is based on OCRopus."/>
    <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word'/>
    <meta name='ocr-system' content='custom'/>

</head>
<body>
<div class="ocr_page"
     title="image sophoclesplaysa05campgoog_0012.png; image sophoclesplaysa05campgoog_0012.png; bbox 0 0 3030 5024"
     style="writing-mode: horizontal-tb;">
    <div class='ocr_carea' id='block_1_1' title="bbox 0 0 3030 5024">
        <p class='ocr_par' id='par_1_1' title="bbox 0 0 3030 5024">
            <span class="ocr_line" id="line_0"
                  title="bbox 312 1109 923 1169; x_bboxes 313 1109 313 1169 313 1169 313 1109 361 1109 361 1169 361 1169 361 1109 409 1109 409 1169 409 1169 409 1109 457 1109 457 1169 457 1169 457 1109 505 1109 505 1169 505 1169 505 1109 553 1109 553 1169 553 1169 553 1109 585 1109 585 1169 585 1169 585 1109 634 1109 634 1169 634 1169 634 1109 682 1109 682 1169 682 1169 682 1109 698 1109 698 1169 698 1169 698 1109 746 1109 746 1169 746 1169 746 1109 794 1109 794 1169 794 1169 794 1109 858 1109 858 1169 906 1169 906 1109 922 1109 922 1169 922 1169 922 1109">
                    <span class="ocrx_word" id="segment_0" title="bbox 313 1109 799 1169" data-min-confidence="0.76"
                          data-average-confidence="0.96" data-manually-confirmed="false" data-spellcheck-mode="None"
                          data-selected-form="INTRODCCTIO&#x39D;">INTRODCCTIOΝ</span>
                    <span class="ocrx_word" id="segment_2" title="bbox 915 1109 922 1169" data-min-confidence="1.0"
                          data-average-confidence="1.0" data-manually-confirmed="false" data-spellcheck-mode="Numerical"
                          data-selected-form=".">.</span>
        </span>

In that case, conversion to PAGE XML works fine. Is this normal ?