cdgriffith / puremagic

Pure python implementation of identifying files based off their magic numbers
MIT License
158 stars 34 forks source link

2024 05 04 Experimental regex support (No rush to merge, proof of concept/feasibility discussion) #65

Closed NebularNerd closed 3 months ago

NebularNerd commented 3 months ago

Could close #12, but likely not (skip to conclusion for why)

Based on trying to think of a way to help improve matches further (it's a great cure for my insomnia at night) I wanted to try adding a REGEX matcher into PureMagic. This should allow for higher confidence hits on files, especially those that share common markers such a PK and PAK. This may not be the definitive solution (I'll discuss why below) but it's a start towards making PureMagic more powerful.

How it works:

  1. Scan for regular magic bytes as normal
  2. Find a matching entry inside the multi-part data
  3. Scan either a defined block size from the start of the file where we can be certain it's somewhere in that region, or scan the whole file (where we have no idea of a fixed point).
  4. Find a match and add to results list

Example test entries in the .json:

    "504b030414000600" : [
      ["776f72642f646f63756d656e742e786d6c", 3000, ".docx", "application/vnd.openxmlformats-officedocument.wordprocessingml.document", "###REGEX### MS Office Open XML Format Word Document"],
      ["786c2f776f726b626f6f6b2e786d6c", 3000, ".xlsx", "application/vnd.openxmlformats-officedocument.wordprocessingml.document", "###REGEX### Microsoft Office 2007+ Open XML Format Excel Document file"],
      ["786c2f76626150726f6a6563742e62696e", 0, ".xlsm", "application/vnd.ms-excel.sheet.macroEnabled.12","###REGEX### Microsoft Excel - Macro-Enabled Workbook"]
    ]

Both .docx and .xlsx have 3000 in the offset field, this is because we can be 99% certain the matching bytes will be within the first 3000 bytes of the file. However, .xlsm has a 0 as we know what we are looking for but it could be anywhere in the file. As all three of these examples are essentially .zip files we can cheat and just use path/filenames we are expecting in the archive. As the structure is mostly rigid, we can assume (and my tests show):

Implementation:

To save reinventing the wheel I have leveraged the existing multi-part system, it works for this concept and only required minimal changes to the code. In the .json, entries are treated as before with the only difference being the offset becomes a block size from byte zero to scan, or 0 if unknown, additionally we prefix the name with ###REGEX### (with a space) as a nice clear trigger that would not have any real world use.

In the code if the trigger is found in the name it will REGEX, otherwise it will perform the normal string matching as before. I ran the code through BLACK so it looks a bit wacky in the layout but that's how it wants it to be.

PROS:

This should improve anything, while using .docx, .xlsx and .xlsm for examples we could fingerprint similar PK based files.

CONS:

While this will bring better results, there are some downsides/considerations:

CONCLUSION:

This leads to this solution likely not being the definitive one but more a starting point, a rule-based system like @cdgriffith proposes is still the better path, for this. Along those lines a better solution could be:

Wow that was a lot of writing, especially as we'll likely not use this in the long run 🤣 Thoughts and suggestions?

NebularNerd commented 3 months ago

Sorry about the million commits, still learning BLACK nuances

cdgriffith commented 3 months ago

Thank you for all your work towards this!

I agree with the goals of this for improved detection, but think the current JSON file is getting too limited for these more advanced techniques.

As part of the 2.0 push (spurred by you, so thank you!) I am working on switching from reading from the JSON file and either putting the data in python itself, and possibly inside a graph instead of lists, so unsure how that will change everything as of this moment.

Don't have any straight up answers at the moment, just wanted to actually reply for now to to all your hard work, thank you again!

NebularNerd commented 3 months ago

Thanks @cdgriffith 🙂

I think if we both agree we can close this for now, I'll keep my branch open so we can borrow/steal some of the code later if we find a new home for it.

I'm glad you like the idea, but I agree we're asking a lot from the .json and it would make adding data trickier later having the mixed implementations all jumbled up.

My recent ideas in #68 and #69 regarding naming and the amount of matching we could do in a more advanced system could help push PureMagic to provide very robust and detailed confidences, it will be interesting to see how far we can go. 🙂

NebularNerd commented 3 months ago

I shall close this for now as it's not something we are going to use. I'll keep the branch alive on my fork so we can re-use so aspects in v2.0 if needed.