Closed NebularNerd closed 6 months ago
Sorry about the million commits, still learning BLACK nuances
Thank you for all your work towards this!
I agree with the goals of this for improved detection, but think the current JSON file is getting too limited for these more advanced techniques.
As part of the 2.0 push (spurred by you, so thank you!) I am working on switching from reading from the JSON file and either putting the data in python itself, and possibly inside a graph instead of lists, so unsure how that will change everything as of this moment.
Don't have any straight up answers at the moment, just wanted to actually reply for now to to all your hard work, thank you again!
Thanks @cdgriffith 🙂
I think if we both agree we can close this for now, I'll keep my branch open so we can borrow/steal some of the code later if we find a new home for it.
I'm glad you like the idea, but I agree we're asking a lot from the .json and it would make adding data trickier later having the mixed implementations all jumbled up.
My recent ideas in #68 and #69 regarding naming and the amount of matching we could do in a more advanced system could help push PureMagic to provide very robust and detailed confidences, it will be interesting to see how far we can go. 🙂
I shall close this for now as it's not something we are going to use. I'll keep the branch alive on my fork so we can re-use so aspects in v2.0 if needed.
Could close #12, but likely not (skip to conclusion for why)
Based on trying to think of a way to help improve matches further (it's a great cure for my insomnia at night) I wanted to try adding a REGEX matcher into PureMagic. This should allow for higher confidence hits on files, especially those that share common markers such a
PK
andPAK
. This may not be the definitive solution (I'll discuss why below) but it's a start towards making PureMagic more powerful.How it works:
multi-part
dataExample test entries in the .json:
Both
.docx
and.xlsx
have 3000 in the offset field, this is because we can be 99% certain the matching bytes will be within the first 3000 bytes of the file. However,.xlsm
has a 0 as we know what we are looking for but it could be anywhere in the file. As all three of these examples are essentially .zip files we can cheat and just use path/filenames we are expecting in the archive. As the structure is mostly rigid, we can assume (and my tests show):0x776f72642f646f63756d656e742e786d6c
/word/document.xml
will be in the first 3000 bytes for.docx
0x786c2f776f726b626f6f6b2e786d6c
/xl/workbook.xml
will be in the first 3000 bytes for.xlsx
0x786c2f76626150726f6a6563742e62696e
/xl/vbaProject.bin
will be somewhere in a.xlsm
, as it comes after the spreadsheet data we cannot predict a scan area.Implementation:
To save reinventing the wheel I have leveraged the existing
multi-part
system, it works for this concept and only required minimal changes to the code. In the .json, entries are treated as before with the only difference being the offset becomes a block size from byte zero to scan, or 0 if unknown, additionally we prefix the name with###REGEX###
(with a space) as a nice clear trigger that would not have any real world use.In the code if the trigger is found in the name it will REGEX, otherwise it will perform the normal string matching as before. I ran the code through BLACK so it looks a bit wacky in the layout but that's how it wants it to be.
PROS:
This should improve anything, while using
.docx
,.xlsx
and.xlsm
for examples we could fingerprint similarPK
based files.0x4d4554412d494e462f4d414e49464553542e4d46
/META-INF/MANIFEST.MF
0x416e64726f69644d616e69666573742e786d6c
/AndroidManifest.xml
0x6d696d65747970656170706c69636174696f6e2f766e642e6f617369732e6f70656e646f63756d656e742e74657874
/mimetypeapplication/vnd.oasis.opendocument.text
(This may be a fixed string, need to investigate further but for the purpose of conversation I'll include it for now)CONS:
While this will bring better results, there are some downsides/considerations:
PK
based files. In testing I can now generate a lot of 0.8's but we need to possibly change the logic so the longest match always wins and is presented as first match, see examples in https://github.com/cdgriffith/puremagic/issues/12#issuecomment-2094122612. (Should be fixed by #66)PK
based files have common roots and therefore share common files..jar
and.apk
both containMETA-INF/MANIFEST.MF
so an.apk
would likely give an equal score to a.jar
. The same applies for.xlsx
andxlsm
, they both havexl/workbook.xml
in their file structure.PK
based files, while it should be safe to assume that all files will always use the same case for filenames, it's entirely possible for them not to. For matching purposes, we may need to look at better fuzzy logic in the regex's to ensure thatMETA-INF/MANIFEST.MF
,Meta-Inf/Manifest.Mf
andmeta-inf/manifest.mf
are all matchable.CONCLUSION:
This leads to this solution likely not being the definitive one but more a starting point, a rule-based system like @cdgriffith proposes is still the better path, for this. Along those lines a better solution could be:
504b030414000600.py
(PK) file lives inside adefinitions
folderWow that was a lot of writing, especially as we'll likely not use this in the long run 🤣 Thoughts and suggestions?