This useful idea is hidden in issue 918 entitled "Teach tesseract to recognize
columns", I think it deserves a separate issue.
There is some code available at https://code.google.com/r/email-hocr-tsv/ and
an online demo. A sample output is temporarily available at
http://teksty.klf.uw.edu.pl/12/1/alice_1.png.hocr.tsv.
I made some comments to issue 918 but now I came to the conclusion that the TSV
format should provide all the information available in hOCR, namely:
level page_num block_num par_num line_num word_num left top width
height baseline conf dir lang font fsize strong/em text.
Original issue reported on code.google.com by jsb...@mimuw.edu.pl on 9 Nov 2014 at 6:06
Original issue reported on code.google.com by
jsb...@mimuw.edu.pl
on 9 Nov 2014 at 6:06