documentcloud / docsplit

Break Apart Documents into Images, Text, Pages and PDFs
http://documentcloud.github.io/docsplit/
Other
831 stars 214 forks source link

Add option to generate hOCR output instead of raw text when performing OCR via tesseract #81

Closed jhosteny closed 11 years ago

jhosteny commented 11 years ago

This patch forces tesseract to genrate hOCR output when the --hocr option is added. It also suppresses text cleaning. This addresses issue #80.

knowtheory commented 11 years ago

Hey @jhosteny, have you tested out this patch? As far as i'm aware, you have to actually pass in a config file, which this pull request doesn't actually supply.

jhosteny commented 11 years ago

@knowtheory, sorry for the late reply. Yes, I am using my fork with this change in a project, and no additional configuration is necessary. I'm running with the latest tesseract on ubuntu raring. Here are the details:

tesseract 3.02.01
 leptonica-1.69
  libgif 4.1.6 : libjpeg 8b : libpng 1.2.49 : libtiff 4.0.2 : zlib 1.2.7

I may have missed something, but it didn't look like there was a test that runs tesseract. If you'd rather wait until one is there, I can work on that as part of a new patch.

jsfenfen commented 11 years ago

@knowtheory: This works for me while running "Tesseract Open Source OCR Engine v3.02.02" on Ubuntu 12.04, w/ leptonica 1.69. I think that the argument--i.e. "hocr" -- is actually the name of the config file to use, and I'm guessing it only works if a config file of that name is in the right place (maybe /somewhere/tessdata/configs/ ). The documentation isn't especially clear. The hocr file used is defined here http://code.google.com/p/tesseract-ocr/source/browse/trunk/tessdata/configs/hocr -- the whole set of default configs is available here: http://code.google.com/p/tesseract-ocr/source/browse/#svn/trunk/tessdata/configs

For the sake of argument, would it make sense for the patch to just give the option of specifying a path to a config file? That way a more complex config file could be used, and it wouldn't be explicitly dependent on the tesseract library shipping with the default configs.

jhosteny commented 11 years ago

Close in lieu of #92