ecit241 / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

hocr output should contain rotation #885

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Use an input image at a rotation (e.g., 90 degrees)
2. Output the text to hocr

What is the expected output? What do you see instead?
The hocr output should ideally provide some indication about the rotation angle 
used before OCRing the text.  Without this angle it makes it impossible to 
reconcile the output hocr text with the original image.

What version of the product are you using? On what operating system?
Windows/Linux, 3.02

Please provide any additional information below.

Original issue reported on code.google.com by matth...@gmail.com on 3 Apr 2013 at 4:55

GoogleCodeExporter commented 9 years ago
I OCRed images containing as well horizontal as vertical text (in the same 
image). Tesseract recognized the horizontal and vertical text, but did not 
define resp. "textangle" attributes in the generated HOCR file
This attribute is foreseen in the hocr specification and should be supported by 
tesseract.

Original comment by julien.p...@googlemail.com on 25 Apr 2013 at 7:35

GoogleCodeExporter commented 9 years ago
Can you please post example image and (desired) hocr output?

Original comment by zde...@gmail.com on 16 May 2013 at 2:04

GoogleCodeExporter commented 9 years ago
Hi,

Here are the requested data:
- example image (I created a dummy image, as I do not want to disclose my 
documents)
- current hocr output (produced with 3.02 running on freebsd)
- A PDF file showing the detected text as well as the respective bbox borders 
for words and paragraphs
- the desired hocr output

Remark 1: The hocr specification does not specify the allowed range of the text 
angle.
I made the assumption that values between -180 and +180 degrees are allowed.
(Ideally the tools using hocr as input should be able to cope with angle in any 
range...)

Remark 2: It was not fully clear to me which html elements should get a 
textangle attribute. For sure ocrx_word and ocr_line should get one. ocr_par 
and ocr_carea may have a textangle attribute if all included words all have the 
same orientation.

Remark 3: It seems that tesseract is currently not good at OCRing text that 
runs for top to bottom (I guess that it currently assumes that the text is 
running from bottom to top, and therefore produces garbage)

Thanks

For information, please also find below an extract of the hocr format 
specification (part relating to the textangle attribute)
(https://docs.google.com/document/d/1QQnIQtvdAC_8n92-LhwPcjtAUFwBlzE8EWnKAxlgVf0
/preview)

"Properties are defined by putting information into the “title=” attribute 
of an HTML tag. Properties in title attributes are of the form “name 
values…”, and multiple properties are separated by semicolons.

...

    textangle alpha - the angle in degrees by which textual content has been rotate relative to the rest of the page (if not present, the angle is assumed to be zero); rotations are counter-clockwise, so an angle of 90 degrees is vertical text running from bottom to top in Latin script; note that this is different from reading order, which should be indicated using standard HTML properties"

Original comment by julien.p...@googlemail.com on 16 May 2013 at 7:32

Attachments: