tesseract hOCR output with page, line and word classes, so it can be converted to djvu-hidden-text structure

AmitGorvadiya / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr

Other

0 stars 0 forks source link

tesseract hOCR output with page, line and word classes, so it can be converted to djvu-hidden-text structure #221

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago

Hello everybody,

I posted request for a feature that would make it possible to do automated
OCR for DjVu hidden text systems so that these files can be searched and
words can be selected similar as a PDF document.

please see this thread for all the information:
http://groups.google.com/group/tesseract-ocr/msg/c902b36a01ba8f11?hl=en

Thanks in advance,

Jelle de Jong

Original issue reported on code.google.com by jong...@gmail.com on 15 Jul 2009 at 7:32

GoogleCodeExporter commented 9 years ago

I have posted a patch which implements hOCR output support. See

http://code.google.com/p/tesseract-ocr/issues/detail?id=263

Original comment by amkryu...@gmail.com on 22 Nov 2009 at 4:34

GoogleCodeExporter commented 9 years ago

Fixed by patch in issue 263 and in 3.00.

Original comment by theraysm...@gmail.com on 19 May 2010 at 11:10

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

I've made a quick and dirty patch to add djvused output to tesseract-ocr 
v3.02.02.  It still needs testing with CJK and right-to-left scripts and 
multipage OCR, but it works for single pages in Russian.  The patch adds a new 
configuration option "tessedit_create_djvused".  tesseract(1) man page is left 
intact since I'm (sadly) not familiar with the syntax.

Original comment by ksa...@gmail.com on 6 Aug 2013 at 10:06

GoogleCodeExporter commented 9 years ago

Updated the patch with straight UTF-8 encoded output djvused happily accepts.

Original comment by ksa...@gmail.com on 7 Aug 2013 at 3:59

Attachments:

tesseract-ocr-3.02.02-djvused-output.patch

GoogleCodeExporter commented 9 years ago

Is such an option really needed? Why not use Jakub Wilk's hocr2djvused 
distributed with ocrodjvu, or just ocrodjvu which since version 0.7.15 supports 
also tesseract 3.02: http://jwilk.net/software/ocrodjvu.

Original comment by jsb...@mimuw.edu.pl on 8 Aug 2013 at 4:19

GoogleCodeExporter commented 9 years ago

hOCR and djvused are the most used OCR output formats nowadays beside plain 
text, why no support both, especially if it's quite trivial?

I use Gentoo and Debian distributions, and ocrdjvu is not in the official 
repositories, while tesseract is.  Moreover, hocr2djvused encodes UTF-8 
characters as escaped octals which makes non-English djvused it produces pretty 
uneditable — even though djvused accepts UTF-8 as it is.

Original comment by ksa...@gmail.com on 8 Aug 2013 at 7:39

GoogleCodeExporter commented 9 years ago

A corection: ocrodjvu is in the official repositories of Debian, Ubuntu and 
openSUSE (but unfortunately not always the latest version).

Original comment by jsb...@mimuw.edu.pl on 8 Aug 2013 at 6:54

GoogleCodeExporter commented 9 years ago

My bad, made a silly typo and didn't double-check.  Missing from Gentoo repos 
though (my primary distro), and the "uneditableness" issue still stand.

Original comment by ksa...@gmail.com on 8 Aug 2013 at 8:21