making magic number-based detection of PDFs encoding-friendly, with tests

jonoterc commented 10 years ago

Magic number-based PDF-detection is vulnerable to encoding issues; forcing the first line to be read in binary mode should work across encodings (as the "%PDF" marker will be present using ASCII-7-compatible characters in any case). Also, removing the "end of line" anchor from the PDF version regex to avoid issues with non-printing characters in certain cases.

Added tests, details:

extracted PDF detection from ensure_pdfs method for more direct testing
new unit tests verify that PDFs without the ".pdf" extensions are detected as PDFs
- samples are included under "test/fixtures/without_pdf_extension"
- add further PDFs (without extensions) and they will be automatically included in tests
- PDFs generated by Adobe InDesign previously raised encoding errors; samples (for each recent PDF version) are included
new unit tests verify that files with the ".pdf" extensions are always detected as PDFs (regardless of file contents)
- samples are included under "test/fixtures/with_pdf_extension"
- samples file names reflect actual contents
- not sure that this is the correct behavior, but not looking to change current approach

jashkenas commented 10 years ago

Nice!

knowtheory commented 10 years ago

Thanks for fixing this @jonoterc and sorry for the delay in merging it (and :heart: the additional tests). Just cut a release for it: https://rubygems.org/gems/docsplit/versions/0.7.5

jonoterc commented 10 years ago

Great, thanks!

documentcloud / docsplit

making magic number-based detection of PDFs encoding-friendly, with tests #108