Magic number-based PDF-detection is vulnerable to encoding issues; forcing the first line to be read in binary mode should work across encodings (as the "%PDF" marker will be present using ASCII-7-compatible characters in any case). Also, removing the "end of line" anchor from the PDF version regex to avoid issues with non-printing characters in certain cases.
Added tests, details:
extracted PDF detection from ensure_pdfs method for more direct testing
new unit tests verify that PDFs without the ".pdf" extensions are detected as PDFs
samples are included under "test/fixtures/without_pdf_extension"
add further PDFs (without extensions) and they will be automatically included in tests
PDFs generated by Adobe InDesign previously raised encoding errors; samples (for each recent PDF version) are included
new unit tests verify that files with the ".pdf" extensions are always detected as PDFs (regardless of file contents)
samples are included under "test/fixtures/with_pdf_extension"
samples file names reflect actual contents
not sure that this is the correct behavior, but not looking to change current approach
Magic number-based PDF-detection is vulnerable to encoding issues; forcing the first line to be read in binary mode should work across encodings (as the "%PDF" marker will be present using ASCII-7-compatible characters in any case). Also, removing the "end of line" anchor from the PDF version regex to avoid issues with non-printing characters in certain cases.
Added tests, details: