Closed zuphilip closed 8 years ago
gcv2hocr is specify image size in argument.
hocr-pdf in hocr-tools, it is not need to specify image size, it read image size from the image.
It is not need to specify image size to make a searchable pdf by hocr-pdf. So, I think image size specification is remove from argument in gcv2hocr.
You don't have to delete the parameters completely, but maybe just make them optional, i.e.
(1) It is possible to indicate the width and height and then they will be written in the bbox
of ocr-page
:
gcv2hocr test.jpg.json output.hocr 1280 960
(2) It is also possible to skip these two parameters, when the corresponding bbox
will then be just empty:
gcv2hocr test.jpg.json output.hocr
bbox
is not required, but it should be given for ocr_page
.
I think we can use the maximum x
and y
in for the boundingPoly
as the bbox
of the page. This is not completely correct since the right/bottom margins are missing but to render the text it's still better than guessing.
In gcv2hocr, frame of the text is obtained from json file ( I misspell it " flame" in variable).
It may possible to add text frame value as boundingpoly.
In gcv2hocr, bbox is generated from the coordinate of the recognized words.
It is better to generate bbox from each line of the text sentence instead of the word.
It is tough for me and hocr-pdf can generate searchable pdf, so I will not touch this issue.
bbox
is not required, but it should be given forocr_page
.
@kba What is the reason for that? I would suggest to use ocr_carea
for the content area, which is also given by GCV OCR. Something like
- <div class='ocr_page' id='page_1' title='image "test.jpg"; bbox 0 0 826 1169; ppageno 0'>
- <div class='ocr_carea' id='block_1_1' title="bbox 80 109 184 131 ">
+ <div class='ocr_page' id='page_1' title='image "test.jpg"; ppageno 0'>
+ <div class='ocr_carea' id='block_1_1' title="bbox 0 0 826 1169">
...
<span class='ocrx_word' id='word_1_1' title='bbox 80 109 184 131 ; x_wconf 85' lang='eng' dir='ltr'> Optical </span>
...
</div>
</div>
For hocr-pdf
it is only important to have the ocr_line
and ocrx_words
with their attributes.
BTW currently you have an ocr_carea
around every word, which might be not optimal. Therefore I suggest to use only one ocr_carea
around the whole content.
bbox is not required, but it should be given for ocr_page. @kba What is the reason for that?
It's in the spec. carea
fits nicely here but has the disadvantage that there can be many of them, so it's not exactly equivalent to "print space". In the end, it's lots of guesswork, since GCV does not give enough information and hocr is ambiguous.
<div class='ocr_page' lang='unknown' title='bbox 0 0 748 1016'>
<div class='ocr_carea' lang='unknown' title='bbox 78 103 748 1016'>
<span class='ocr_line' id='line_0' title='bbox 80 109 184 131; baseline 0 -5'>
<span class='ocrx_word' id='word_0_0' title='bbox 80 109 184 131'>Optical</span>
</span>
This is how it looks like now. There's some heuristic involved in guessing the lines (maximum of word x1 / y1, minimum of word x0 y0) plus some tolerance for slightly offset baselines (--baseline-tolerance
).
I see. The specs are hard to understand and in most cases it is for me not clear if some things are possible or necessary. AFAIK Ocropus is not providing the bbox
for ocr_page
, see https://github.com/tmbdev/ocropy/blob/master/ocropus-hocr#L66 .
I think the hocr-pdf equivalent in OCRmyPDF does expect bbox
for ocr_page
.
I have committed the code to make height and width parameter optional. https://github.com/dinosauria123/gcv2hocr/commit/d2ec0c2b7e97671feb1244a441c5b954c09054d7
Thank you!
The height and width parameter are at the moment needed to call the transformation. However, it is not part of the output of the Google Cloud Vision OCR. Therefore I suggest to make these two parameters optional. I think the only change is that you cannot write the
bbox
in theocr-page
element, but this is AFAIK not required by thehocr
specs. What do you think? / CC @kba