Closed jhosteny closed 11 years ago
Hey @jhosteny, have you tested out this patch? As far as i'm aware, you have to actually pass in a config file, which this pull request doesn't actually supply.
@knowtheory, sorry for the late reply. Yes, I am using my fork with this change in a project, and no additional configuration is necessary. I'm running with the latest tesseract on ubuntu raring. Here are the details:
tesseract 3.02.01
leptonica-1.69
libgif 4.1.6 : libjpeg 8b : libpng 1.2.49 : libtiff 4.0.2 : zlib 1.2.7
I may have missed something, but it didn't look like there was a test that runs tesseract. If you'd rather wait until one is there, I can work on that as part of a new patch.
@knowtheory: This works for me while running "Tesseract Open Source OCR Engine v3.02.02" on Ubuntu 12.04, w/ leptonica 1.69. I think that the argument--i.e. "hocr" -- is actually the name of the config file to use, and I'm guessing it only works if a config file of that name is in the right place (maybe /somewhere/tessdata/configs/ ). The documentation isn't especially clear. The hocr file used is defined here http://code.google.com/p/tesseract-ocr/source/browse/trunk/tessdata/configs/hocr -- the whole set of default configs is available here: http://code.google.com/p/tesseract-ocr/source/browse/#svn/trunk/tessdata/configs
For the sake of argument, would it make sense for the patch to just give the option of specifying a path to a config file? That way a more complex config file could be used, and it wouldn't be explicitly dependent on the tesseract library shipping with the default configs.
Close in lieu of #92
This patch forces tesseract to genrate hOCR output when the
--hocr
option is added. It also suppresses text cleaning. This addresses issue #80.