dinosauria123 / gcv2hocr

gcv2hocr converts from Google Cloud Vision OCR output to hocr to make a searchable pdf.
99 stars 33 forks source link

Make height and width parameter optional #2

Closed zuphilip closed 7 years ago

zuphilip commented 7 years ago

The height and width parameter are at the moment needed to call the transformation. However, it is not part of the output of the Google Cloud Vision OCR. Therefore I suggest to make these two parameters optional. I think the only change is that you cannot write the bbox in the ocr-page element, but this is AFAIK not required by the hocr specs. What do you think? / CC @kba

dinosauria123 commented 7 years ago

gcv2hocr is specify image size in argument.

hocr-pdf in hocr-tools, it is not need to specify image size, it read image size from the image.

It is not need to specify image size to make a searchable pdf by hocr-pdf. So, I think image size specification is remove from argument in gcv2hocr.

zuphilip commented 7 years ago

You don't have to delete the parameters completely, but maybe just make them optional, i.e.

(1) It is possible to indicate the width and height and then they will be written in the bbox of ocr-page:

gcv2hocr test.jpg.json output.hocr 1280 960

(2) It is also possible to skip these two parameters, when the corresponding bbox will then be just empty:

gcv2hocr test.jpg.json output.hocr
kba commented 7 years ago

bbox is not required, but it should be given for ocr_page.

I think we can use the maximum x and y in for the boundingPoly as the bbox of the page. This is not completely correct since the right/bottom margins are missing but to render the text it's still better than guessing.

dinosauria123 commented 7 years ago

In gcv2hocr, frame of the text is obtained from json file ( I misspell it " flame" in variable).

It may possible to add text frame value as boundingpoly.

dinosauria123 commented 7 years ago

In gcv2hocr, bbox is generated from the coordinate of the recognized words.

It is better to generate bbox from each line of the text sentence instead of the word.

It is tough for me and hocr-pdf can generate searchable pdf, so I will not touch this issue.

zuphilip commented 7 years ago

bbox is not required, but it should be given for ocr_page.

@kba What is the reason for that? I would suggest to use ocr_carea for the content area, which is also given by GCV OCR. Something like

-  <div class='ocr_page' id='page_1' title='image "test.jpg"; bbox 0 0 826 1169; ppageno 0'>
-    <div class='ocr_carea' id='block_1_1' title="bbox 80 109 184 131 ">
+  <div class='ocr_page' id='page_1' title='image "test.jpg"; ppageno 0'>
+    <div class='ocr_carea' id='block_1_1' title="bbox 0 0 826 1169">
   ...
      <span class='ocrx_word' id='word_1_1' title='bbox 80 109 184 131 ; x_wconf 85' lang='eng' dir='ltr'>  Optical  </span> 
   ...
   </div>
</div>

For hocr-pdf it is only important to have the ocr_line and ocrx_words with their attributes.

BTW currently you have an ocr_carea around every word, which might be not optimal. Therefore I suggest to use only one ocr_carea around the whole content.

kba commented 7 years ago

bbox is not required, but it should be given for ocr_page. @kba What is the reason for that?

It's in the spec. carea fits nicely here but has the disadvantage that there can be many of them, so it's not exactly equivalent to "print space". In the end, it's lots of guesswork, since GCV does not give enough information and hocr is ambiguous.

<div class='ocr_page' lang='unknown' title='bbox 0 0 748 1016'>
        <div class='ocr_carea' lang='unknown' title='bbox 78 103 748 1016'>
            <span class='ocr_line' id='line_0' title='bbox 80 109 184 131; baseline 0 -5'>
                <span class='ocrx_word' id='word_0_0' title='bbox 80 109 184 131'>Optical</span>
            </span>

This is how it looks like now. There's some heuristic involved in guessing the lines (maximum of word x1 / y1, minimum of word x0 y0) plus some tolerance for slightly offset baselines (--baseline-tolerance).

zuphilip commented 7 years ago

I see. The specs are hard to understand and in most cases it is for me not clear if some things are possible or necessary. AFAIK Ocropus is not providing the bbox for ocr_page, see https://github.com/tmbdev/ocropy/blob/master/ocropus-hocr#L66 .

kba commented 7 years ago

I think the hocr-pdf equivalent in OCRmyPDF does expect bbox for ocr_page.

dinosauria123 commented 7 years ago

I have committed the code to make height and width parameter optional. https://github.com/dinosauria123/gcv2hocr/commit/d2ec0c2b7e97671feb1244a441c5b954c09054d7

zuphilip commented 7 years ago

Thank you!