Open intellisense opened 9 years ago
First Q, are you OCRing english or non-english docs?
If the latter, you can set the --no-clean
flag (if you're using Docsplit from the commandline). If you upgrade to 0.7.6
setting the --language
flag will automatically set --no-clean
.
If you're OCRing an english language doc, we'd be interested in seeing a sample doc (as our TextCleaner isn't doing the right thing if that's the case).
Thanks for the quick reply. Yes we are OCRing the docs. No matter in what language. This doc failed with the above same error. Although it is English with some hand writing in it. And I am using docsplit 0.7.5
Alrighty, mind letting us know what tesseract version you're using? We're up on docsplit 0.7.6
and tesseract 3.03
(succeeded in processing the doc linked above). Looks like you're on ubuntu?
Here are the full environment details:
Ubuntu 14.04 (trusty)
docsplit 0.7.5
tesseract 3.03
Let me know if you want any more information. Thanks
the TextCleaner will strip out character sequences that look like garbage in English (lots of consonants in a row for example). So if your input is clean-ish turning it off won't do much.
So the text extraction is only works on English? Any handy tool you can recommend which can extract the plain text out of non english pdf's easily?
Text cleaning only works in english. Docsplit'll OCR in non-english languages if you specify the input language.
@intellisense: My environment is pretty close to yours and I'm able to extract your documents successfully.
Can you tell me what docsplit command you are running? I ran: docsplit text <pdf_file>
Can you also provide the Ruby version from ruby --version
? Mine is ruby 1.9.3p484 (2013-11-22 revision 43786) [x86_64-linux]
@nathanstitt I am using this command: docsplit text --output /output/path/abc.txt /input/path/abc.pdf
The ruby version is exactly same as yours ruby 1.9.3p484 (2013-11-22 revision 43786) [x86_64-linux]
.
I just ran the command with --no-clean
flag and it works. But without this flag I am having trouble as mentioned above.
Hm. Since our commands and ruby versions are the same, I'm thinking that the culprit may be Tesseract. Perhaps your version is generating some sequence of UTF characters that Docsplit/Ruby doesn't like.
My tesseract --version
reports:
tesseract 3.03 leptonica-1.70 libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0
Do your versions differ?
I should also note that docsplit/tesseract didn't do a very good job on the second document you linked above. Since the scan's were blurry, the text is pretty garbled. The text scanner attempted to clean it, but the difference between using --no-clean
and the normal command line wasn't very large. I think you'll be fine to use the --no-clean
flag if we can't get to the bottom of the issue.
The tesseract version is exactly the same as yours with every image libraries as you have mentioned no difference whatsoever. I think I should go with the --no-clean
flag but its not an optimal solution as I want to support text extraction from Non-English documents as well. You can close this if you want to. Thanks for the help I highly appreciate it.
Hey @intellisense. Sorry for the confusion but you absolutely can extract text from non-English documents with or without using the --no-clean
flag. In fact, if you are extracting from non-English documents, the no-clean
flag is set to true internally and it's usage is ignored.
All the option does is disable running the TextCleaner (which removes non-valid characters) on the OCR'ed text. Since the TextCleaner only knows how to recognize non-english characters that's the only language it's effective on.
I am getting several errors like these. Any workaround? Thanks!