documentcloud / docsplit

Break Apart Documents into Images, Text, Pages and PDFs
http://documentcloud.github.io/docsplit/
Other
831 stars 214 forks source link

making magic number-based detection of PDFs encoding-friendly, with tests #108

Closed jonoterc closed 10 years ago

jonoterc commented 10 years ago

Magic number-based PDF-detection is vulnerable to encoding issues; forcing the first line to be read in binary mode should work across encodings (as the "%PDF" marker will be present using ASCII-7-compatible characters in any case). Also, removing the "end of line" anchor from the PDF version regex to avoid issues with non-printing characters in certain cases.

Added tests, details:

jashkenas commented 10 years ago

Nice!

knowtheory commented 10 years ago

Thanks for fixing this @jonoterc and sorry for the delay in merging it (and :heart: the additional tests). Just cut a release for it: https://rubygems.org/gems/docsplit/versions/0.7.5

jonoterc commented 10 years ago

Great, thanks!