empty words in hOCR output

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?

Run tesseract 000000.tif 000000 -l pol+deu-frak hocr

on http://fleksem.klf.uw.edu.pl/~jsbien/tesseract_empty-words/000000.tif

What is the expected output? What do you see instead?

The output contains illegal word elements containing a space, e.g.

<span class='ocrx_word' id='word_818' title='bbox 104 5449 4664 5480; x_wconf 
95' lang='pol' dir='ltr'><strong> </strong></span> 

Sometimes the word elements are just empty (but not in this sample).

What version of the product are you using? On what operating system?

tesseract 3.02.02, Debian SID

Please provide any additional information below.

Original issue reported on code.google.com by jsb...@mimuw.edu.pl on 6 May 2013 at 6:30

GoogleCodeExporter commented 9 years ago

Thanks for report. Fixed in r854

Original comment by zde...@gmail.com on 23 Jun 2013 at 3:11

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

I am re-opening this issue because provided solution in r854 cause other 
problems (see issue 946)

Original comment by zde...@gmail.com on 25 Jul 2013 at 3:54

Changed state: Accepted

dlareklami / tesseract-ocr

empty words in hOCR output #903