documentcloud / docsplit

Break Apart Documents into Images, Text, Pages and PDFs
http://documentcloud.github.io/docsplit/
Other
832 stars 214 forks source link

Clean pdffonts output to avoid invalid UTF-8 characters #134

Open tbk303 opened 9 years ago

tbk303 commented 9 years ago

I came across some weird PDF files for which pdffonts outputs invalid UTF-8 chars. This results in a "invalid UTF-8 ..." exception when matching NO_TEXT_DETECTED.

If Ruby 1.9/2.0 compatability is required, I can also extend this pull request with some scrub-polyfill.