documentcloud / docsplit

Break Apart Documents into Images, Text, Pages and PDFs
http://documentcloud.github.com/docsplit/
Other
833 stars 214 forks source link

Encoding issue - invalid byte sequence in US-ASCII (ArgumentError) #121

Open intellisense opened 9 years ago

intellisense commented 9 years ago

I am getting several errors like these. Any workaround? Thanks!

Exception("ErrorCode 1: /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit/text_cleaner.rb:49:in `scan': invalid byte sequence in US-ASCII (ArgumentError)
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit/text_cleaner.rb:49:in `block in clean'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit/text_cleaner.rb:48:in `loop'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit/text_cleaner.rb:48:in `clean'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit.rb:79:in `clean_text'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit/text_extractor.rb:92:in `block in clean_text'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit/text_extractor.rb:88:in `open'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit/text_extractor.rb:88:in `clean_text'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit/text_extractor.rb:78:in `extract_from_ocr'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit/text_extractor.rb:36:in `block in extract'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit/text_extractor.rb:32:in `each'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit/text_extractor.rb:32:in `extract'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit.rb:45:in `extract_text'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit/command_line.rb:46:in `run'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit/command_line.rb:37:in `initialize'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/bin/docsplit:5:in `new'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/bin/docsplit:5:in `<top (required)>'
from /usr/bin/docsplit:23:in `load'
from /usr/bin/docsplit:23:in `<main>')
knowtheory commented 9 years ago

First Q, are you OCRing english or non-english docs?

If the latter, you can set the --no-clean flag (if you're using Docsplit from the commandline). If you upgrade to 0.7.6 setting the --language flag will automatically set --no-clean.

If you're OCRing an english language doc, we'd be interested in seeing a sample doc (as our TextCleaner isn't doing the right thing if that's the case).

intellisense commented 9 years ago

Thanks for the quick reply. Yes we are OCRing the docs. No matter in what language. This doc failed with the above same error. Although it is English with some hand writing in it. And I am using docsplit 0.7.5

knowtheory commented 9 years ago

Alrighty, mind letting us know what tesseract version you're using? We're up on docsplit 0.7.6 and tesseract 3.03 (succeeded in processing the doc linked above). Looks like you're on ubuntu?

intellisense commented 9 years ago

Here are the full environment details:

Ubuntu 14.04 (trusty)
docsplit 0.7.5
tesseract 3.03

Let me know if you want any more information. Thanks

intellisense commented 9 years ago

Here are some more files to test on doc1 and doc2 Please tell me the solution for this issue as we are on production :( also what are the consequences of using --no-clean flag?

Thanks!

knowtheory commented 9 years ago

the TextCleaner will strip out character sequences that look like garbage in English (lots of consonants in a row for example). So if your input is clean-ish turning it off won't do much.

intellisense commented 9 years ago

So the text extraction is only works on English? Any handy tool you can recommend which can extract the plain text out of non english pdf's easily?

knowtheory commented 9 years ago

Text cleaning only works in english. Docsplit'll OCR in non-english languages if you specify the input language.

nathanstitt commented 9 years ago

@intellisense: My environment is pretty close to yours and I'm able to extract your documents successfully.

Can you tell me what docsplit command you are running? I ran: docsplit text <pdf_file>

Can you also provide the Ruby version from ruby --version? Mine is ruby 1.9.3p484 (2013-11-22 revision 43786) [x86_64-linux]

intellisense commented 9 years ago

@nathanstitt I am using this command: docsplit text --output /output/path/abc.txt /input/path/abc.pdf The ruby version is exactly same as yours ruby 1.9.3p484 (2013-11-22 revision 43786) [x86_64-linux].

I just ran the command with --no-clean flag and it works. But without this flag I am having trouble as mentioned above.

nathanstitt commented 9 years ago

Hm. Since our commands and ruby versions are the same, I'm thinking that the culprit may be Tesseract. Perhaps your version is generating some sequence of UTF characters that Docsplit/Ruby doesn't like.

My tesseract --version reports: tesseract 3.03 leptonica-1.70 libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0

Do your versions differ?

I should also note that docsplit/tesseract didn't do a very good job on the second document you linked above. Since the scan's were blurry, the text is pretty garbled. The text scanner attempted to clean it, but the difference between using --no-clean and the normal command line wasn't very large. I think you'll be fine to use the --no-clean flag if we can't get to the bottom of the issue.

intellisense commented 9 years ago

The tesseract version is exactly the same as yours with every image libraries as you have mentioned no difference whatsoever. I think I should go with the --no-clean flag but its not an optimal solution as I want to support text extraction from Non-English documents as well. You can close this if you want to. Thanks for the help I highly appreciate it.

nathanstitt commented 9 years ago

Hey @intellisense. Sorry for the confusion but you absolutely can extract text from non-English documents with or without using the --no-clean flag. In fact, if you are extracting from non-English documents, the no-clean flag is set to true internally and it's usage is ignored.

All the option does is disable running the TextCleaner (which removes non-valid characters) on the OCR'ed text. Since the TextCleaner only knows how to recognize non-english characters that's the only language it's effective on.