documentcloud / docsplit

Break Apart Documents into Images, Text, Pages and PDFs
http://documentcloud.github.com/docsplit/
Other
833 stars 214 forks source link

"Invalid byte sequence error" on master. #106

Closed KurtPreston closed 10 years ago

KurtPreston commented 10 years ago

We are getting the following error whenever converting MS Office docs:

.../lib/docsplit/transparent_pdfs.rb:12:in `block in ensure_pdfs': invalid byte sequence in UTF-8 (ArgumentError)

This has been tested to work on the latest official release (v0.7.4, commit 9172e30), but is broken on master.

luccasmaso commented 10 years ago

Same here. Did you find a solution?

knowtheory commented 10 years ago

Hey guys missed this earlier. It's almost certainly this commit then: https://github.com/documentcloud/docsplit/commit/929a42638999aba5e11883c1a5adad9436f03223

I would be interested to know if you have sample docs you could share that this is failing on.

luccasmaso commented 10 years ago

Just figured out here too! I am attempting to solve this way: File.open(doc, &:readline).force_encoding("BINARY") =~ /\A\%PDF-\d+(\.\d+)?$/

knowtheory commented 10 years ago

Hey guys, there was actually a pull request that handled this which i have merged, and cut a release for.