documentcloud / docsplit

Break Apart Documents into Images, Text, Pages and PDFs
http://documentcloud.github.io/docsplit/
Other
832 stars 214 forks source link

rails invalid byte sequence in UTF-8 #135

Open ghost opened 9 years ago

ghost commented 9 years ago

Hello, i got this error trying to OCR this pdf document: https://www.dropbox.com/s/ko76kalp5p59hwc/contrato%20de%20fianza%20prueba%2010.pdf?dl=0

The code which fails is: Docsplit.extract_text(attachment.path, :output => output_dir, :language => 'spa').

I have tried using:

but non of the above is helping, still fails. A lot of other pdf documents works great.

My environment: Rails 4.2 Ruby 2.2 Docsplit 0.7.6 tesseract-ocr 3.03 tesseract-ocr-spa 3.02

Any help please?

tbk303 commented 8 years ago

Check PR #134 that might fix your problem.