Open GoogleCodeExporter opened 9 years ago
I OCRed images containing as well horizontal as vertical text (in the same
image). Tesseract recognized the horizontal and vertical text, but did not
define resp. "textangle" attributes in the generated HOCR file
This attribute is foreseen in the hocr specification and should be supported by
tesseract.
Original comment by julien.p...@googlemail.com
on 25 Apr 2013 at 7:35
Can you please post example image and (desired) hocr output?
Original comment by zde...@gmail.com
on 16 May 2013 at 2:04
Hi,
Here are the requested data:
- example image (I created a dummy image, as I do not want to disclose my
documents)
- current hocr output (produced with 3.02 running on freebsd)
- A PDF file showing the detected text as well as the respective bbox borders
for words and paragraphs
- the desired hocr output
Remark 1: The hocr specification does not specify the allowed range of the text
angle.
I made the assumption that values between -180 and +180 degrees are allowed.
(Ideally the tools using hocr as input should be able to cope with angle in any
range...)
Remark 2: It was not fully clear to me which html elements should get a
textangle attribute. For sure ocrx_word and ocr_line should get one. ocr_par
and ocr_carea may have a textangle attribute if all included words all have the
same orientation.
Remark 3: It seems that tesseract is currently not good at OCRing text that
runs for top to bottom (I guess that it currently assumes that the text is
running from bottom to top, and therefore produces garbage)
Thanks
For information, please also find below an extract of the hocr format
specification (part relating to the textangle attribute)
(https://docs.google.com/document/d/1QQnIQtvdAC_8n92-LhwPcjtAUFwBlzE8EWnKAxlgVf0
/preview)
"Properties are defined by putting information into the “title=” attribute
of an HTML tag. Properties in title attributes are of the form “name
values…”, and multiple properties are separated by semicolons.
...
textangle alpha - the angle in degrees by which textual content has been rotate relative to the rest of the page (if not present, the angle is assumed to be zero); rotations are counter-clockwise, so an angle of 90 degrees is vertical text running from bottom to top in Latin script; note that this is different from reading order, which should be indicated using standard HTML properties"
Original comment by julien.p...@googlemail.com
on 16 May 2013 at 7:32
Attachments:
Original issue reported on code.google.com by
matth...@gmail.com
on 3 Apr 2013 at 4:55