2024 05 04 Experimental regex support (No rush to merge, proof of concept/feasibility discussion)

NebularNerd commented 3 months ago

Could close #12, but likely not (skip to conclusion for why)

Based on trying to think of a way to help improve matches further (it's a great cure for my insomnia at night) I wanted to try adding a REGEX matcher into PureMagic. This should allow for higher confidence hits on files, especially those that share common markers such a PK and PAK. This may not be the definitive solution (I'll discuss why below) but it's a start towards making PureMagic more powerful.

How it works:

Scan for regular magic bytes as normal
Find a matching entry inside the multi-part data
Scan either a defined block size from the start of the file where we can be certain it's somewhere in that region, or scan the whole file (where we have no idea of a fixed point).
Find a match and add to results list

Example test entries in the .json:

    "504b030414000600" : [
      ["776f72642f646f63756d656e742e786d6c", 3000, ".docx", "application/vnd.openxmlformats-officedocument.wordprocessingml.document", "###REGEX### MS Office Open XML Format Word Document"],
      ["786c2f776f726b626f6f6b2e786d6c", 3000, ".xlsx", "application/vnd.openxmlformats-officedocument.wordprocessingml.document", "###REGEX### Microsoft Office 2007+ Open XML Format Excel Document file"],
      ["786c2f76626150726f6a6563742e62696e", 0, ".xlsm", "application/vnd.ms-excel.sheet.macroEnabled.12","###REGEX### Microsoft Excel - Macro-Enabled Workbook"]
    ]

Both .docx and .xlsx have 3000 in the offset field, this is because we can be 99% certain the matching bytes will be within the first 3000 bytes of the file. However, .xlsm has a 0 as we know what we are looking for but it could be anywhere in the file. As all three of these examples are essentially .zip files we can cheat and just use path/filenames we are expecting in the archive. As the structure is mostly rigid, we can assume (and my tests show):

0x776f72642f646f63756d656e742e786d6c / word/document.xml will be in the first 3000 bytes for .docx
0x786c2f776f726b626f6f6b2e786d6c / xl/workbook.xml will be in the first 3000 bytes for .xlsx
0x786c2f76626150726f6a6563742e62696e / xl/vbaProject.bin will be somewhere in a .xlsm, as it comes after the spreadsheet data we cannot predict a scan area.

Implementation:

To save reinventing the wheel I have leveraged the existing multi-part system, it works for this concept and only required minimal changes to the code. In the .json, entries are treated as before with the only difference being the offset becomes a block size from byte zero to scan, or 0 if unknown, additionally we prefix the name with ###REGEX### (with a space) as a nice clear trigger that would not have any real world use.

In the code if the trigger is found in the name it will REGEX, otherwise it will perform the normal string matching as before. I ran the code through BLACK so it looks a bit wacky in the layout but that's how it wants it to be.

PROS:

This should improve anything, while using .docx, .xlsx and .xlsm for examples we could fingerprint similar PK based files.

.jar = 0x4d4554412d494e462f4d414e49464553542e4d46 / META-INF/MANIFEST.MF
.apk = 0x416e64726f69644d616e69666573742e786d6c / AndroidManifest.xml
.odt = 0x6d696d65747970656170706c69636174696f6e2f766e642e6f617369732e6f70656e646f63756d656e742e74657874 / mimetypeapplication/vnd.oasis.opendocument.text (This may be a fixed string, need to investigate further but for the purpose of conversation I'll include it for now)

CONS:

While this will bring better results, there are some downsides/considerations:

Memory size / Speed: Trying to regex a whole file could be an issue on low powered system, equally if the file is too big to read into memory in one go it will break something. This could be mitigated with some clever maths to read the file in overlapping chunks that fit within the memory, how to do that without relying on external libraries might need some sideways logic.
Confidence scores: The issue we now face is how to ensure we have a clear winner, especially for PK based files. In testing I can now generate a lot of 0.8's but we need to possibly change the logic so the longest match always wins and is presented as first match, see examples in https://github.com/cdgriffith/puremagic/issues/12#issuecomment-2094122612. (Should be fixed by #66)
Confidence clashes: This is slightly separate from the above, a lot of the PK based files have common roots and therefore share common files. .jar and .apk both contain META-INF/MANIFEST.MF so an .apk would likely give an equal score to a .jar. The same applies for .xlsx and xlsm, they both have xl/workbook.xml in their file structure.
Casing: This again applies mostly to PK based files, while it should be safe to assume that all files will always use the same case for filenames, it's entirely possible for them not to. For matching purposes, we may need to look at better fuzzy logic in the regex's to ensure that META-INF/MANIFEST.MF, Meta-Inf/Manifest.Mf and meta-inf/manifest.mf are all matchable.

CONCLUSION:

This leads to this solution likely not being the definitive one but more a starting point, a rule-based system like @cdgriffith proposes is still the better path, for this. Along those lines a better solution could be:

Perform initial magic.json match
Check if a matching 504b030414000600.py (PK) file lives inside a definitions folder

Process the rules inside such as in this crude example:

If "META-INF/MANIFEST.MF" and "AndroidManifest.xml" it's an .apk
If "META-INF/MANIFEST.MF" and not "AndroidManifest.xml" it's an .jar
etc...
Collate results and send back to main.py

Take those results, add to the string matches and sort the confidence list in order of longest byte match first, the .apk match would have three sets of bytes; PK, MANIFEST and ANDROID which would be a very long match, this should ensure we have a clearer winner.

Wow that was a lot of writing, especially as we'll likely not use this in the long run 🤣 Thoughts and suggestions?

NebularNerd commented 3 months ago

Sorry about the million commits, still learning BLACK nuances

cdgriffith commented 3 months ago

Thank you for all your work towards this!

I agree with the goals of this for improved detection, but think the current JSON file is getting too limited for these more advanced techniques.

As part of the 2.0 push (spurred by you, so thank you!) I am working on switching from reading from the JSON file and either putting the data in python itself, and possibly inside a graph instead of lists, so unsure how that will change everything as of this moment.

Don't have any straight up answers at the moment, just wanted to actually reply for now to to all your hard work, thank you again!

NebularNerd commented 3 months ago

Thanks @cdgriffith 🙂

I think if we both agree we can close this for now, I'll keep my branch open so we can borrow/steal some of the code later if we find a new home for it.

I'm glad you like the idea, but I agree we're asking a lot from the .json and it would make adding data trickier later having the mixed implementations all jumbled up.

My recent ideas in #68 and #69 regarding naming and the amount of matching we could do in a more advanced system could help push PureMagic to provide very robust and detailed confidences, it will be interesting to see how far we can go. 🙂

NebularNerd commented 3 months ago

I shall close this for now as it's not something we are going to use. I'll keep the branch alive on my fork so we can re-use so aspects in v2.0 if needed.

cdgriffith / puremagic