baopham1340 / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

a wishlist: alternative TSV output #1378

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
This useful idea is hidden in issue 918 entitled "Teach tesseract to recognize 
columns", I think it deserves a separate issue. 

There is some code available at https://code.google.com/r/email-hocr-tsv/ and 
an online demo. A sample output is temporarily available at 
http://teksty.klf.uw.edu.pl/12/1/alice_1.png.hocr.tsv.

I made some comments to issue 918 but now I came to the conclusion that the TSV 
format should provide all the information available in hOCR, namely:

  level page_num block_num par_num line_num word_num left top width
  height baseline conf dir lang font fsize strong/em text.

Original issue reported on code.google.com by jsb...@mimuw.edu.pl on 9 Nov 2014 at 6:06

GoogleCodeExporter commented 9 years ago
I've merged that branch, and moved it to a pull request on github: 
https://github.com/tesseract-ocr/tesseract/pull/18
closing this issue.

Original comment by joregan on 13 May 2015 at 9:32