cdgriffith / puremagic

Pure python implementation of identifying files based off their magic numbers
MIT License
158 stars 34 forks source link

PDF files are not always detected #94

Open peterekepeter opened 1 month ago

peterekepeter commented 1 month ago

From my testing the %PDF- does not necessarily have to be at offset 0. It can be located anywhere in the file. For example I can type some junk into the file in the beginning and it still opens file.

I received multiple files like this from people, so there is something or someone out in the wild that adds extra characters in front of the magic sequence.

A detector would look something like that it searches for a substring inside a search window:

def is_pdf(file_path):
    with open(file_path, "rb") as file:
        # may throw IOError
        header = file.read(1024)
        return b"%PDF-" in header

From what I see currently the library is not built to handle this kind of situation. So I'm leaving this ticket here with this code snippet in case more advanced detection is implemented.