cdgriffith / puremagic

Pure python implementation of identifying files based off their magic numbers
MIT License
158 stars 34 forks source link

2024-05-06 Fix Confidence sorting #66

Closed NebularNerd closed 3 months ago

NebularNerd commented 3 months ago

While adding #65 one of the things that flagged up was that no matter how long the byte_match the winner was not the longest match. This quick one liner addresses that issue. Confidence is sorted by confidence then byte_match.

Before, Alternate match 2 should be the winner:

China tax introduction(edited1).docx
Most likely match:
Format:        MS Office Open XML Format Document
Confidence:    80.0%
Extension:     .docx
MIME:          application/vnd.openxmlformats-officedocument.wordprocessingml.document
Offset:        0
Bytes Matched: b'PK\x03\x04\x14\x00\x06\x00'
Hex:           504b 0304 1400 0600
String:        PK

Alternate match #1
Format:        Microsoft Office 2007+ Open XML Format Document file
Confidence:    80.0%
Extension:     .xlsx
MIME:          application/vnd.openxmlformats-officedocument.wordprocessingml.document
Offset:        0
Bytes Matched: b'PK\x03\x04\x14\x00\x06\x00'
Hex:           504b 0304 1400 0600
String:        PK

Alternate match #2
Format:        MS Office Open XML Format Word Document
Confidence:    80.0%
Extension:     .docx
MIME:          application/vnd.openxmlformats-officedocument.wordprocessingml.document
Offset:        0
Bytes Matched: b'PK\x03\x04\x14\x00\x06\x00word/document.xml'
Hex:           504b 0304 1400 0600 776f 7264 2f64 6f63 756d 656e 742e 786d 6c
String:        PKword/document.xml

Alternate match #3
Format:        MS Office Open XML Format Document
Confidence:    40.0%
Extension:     .pptx
MIME:          application/vnd.openxmlformats-officedocument.presentationml.presentation
Offset:        0
Bytes Matched: b'PK\x03\x04'
Hex:           504b 0304
String:        PK

Alternate match #4
Format:        MS Office Open XML Format Document
Confidence:    40.0%
Extension:     .xlsx
MIME:          application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Offset:        0
Bytes Matched: b'PK\x03\x04'
Hex:           504b 0304
String:        PK

Omitting other 20+ matches

Now it is! :

China tax introduction(edited1).docx
Most likely match:
Format:        MS Office Open XML Format Word Document
Confidence:    80.0%
Extension:     .docx
MIME:          application/vnd.openxmlformats-officedocument.wordprocessingml.document
Offset:        3000
Bytes Matched: b'PK\x03\x04\x14\x00\x06\x00word/document.xml'
Hex:           504b 0304 1400 0600 776f 7264 2f64 6f63 756d 656e 742e 786d 6c
String:        PKword/document.xml

Alternate match #1
Format:        MS Office Open XML Format Document
Confidence:    80.0%
Extension:     .docx
MIME:          application/vnd.openxmlformats-officedocument.wordprocessingml.document
Offset:        0
Bytes Matched: b'PK\x03\x04\x14\x00\x06\x00'
Hex:           504b 0304 1400 0600
String:        PK

Alternate match #2
Format:        Microsoft Excel - Macro-Enabled Workbook
Confidence:    80.0%
Extension:     .xlsm
MIME:          application/vnd.ms-excel.sheet.macroEnabled.12
Offset:        0
Bytes Matched: b'PK\x03\x04\x14\x00\x06\x00'
Hex:           504b 0304 1400 0600
String:        PK

Alternate match #3
Format:        Microsoft Office 2007+ Open XML Format Document file
Confidence:    80.0%
Extension:     .xlsx
MIME:          application/vnd.openxmlformats-officedocument.wordprocessingml.document
Offset:        0
Bytes Matched: b'PK\x03\x04\x14\x00\x06\x00'
Hex:           504b 0304 1400 0600
String:        PK

Alternate match #4
Format:        MS Office Open XML Format Document
Confidence:    40.0%
Extension:     .pptx
MIME:          application/vnd.openxmlformats-officedocument.presentationml.presentation
Offset:        0
Bytes Matched: b'PK\x03\x04'
Hex:           504b 0304
String:        PK

Omitting other 20+ matches

This will help in the future with really long matches if PureMagic adopts rules, for example using this .xlsm, Alternate Match 1 would be the better choice as the file matches the .vba aspect, however regular Excel wins as that has a slightly longer match, the Macro flavored version comes second. With a rules based system you could combine those together for a MEGA MATCH! of b'PK\x03\x04\x14\x00\x06\x00xl/workbook.xmlxl/vbaProject.bin' which should be unbeatable 😎:

ID-Generator-MACRO.xlsm
Most likely match:
Format:        Microsoft Office 2007+ Open XML Format Excel Document file
Confidence:    80.0%
Extension:     .xlsx
MIME:          application/vnd.openxmlformats-officedocument.wordprocessingml.document
Offset:        3000
Bytes Matched: b'PK\x03\x04\x14\x00\x06\x00xl/workbook.xml'
Hex:           504b 0304 1400 0600 786c 2f77 6f72 6b62 6f6f 6b2e 786d 6c
String:        PKxl/workbook.xml

Alternate match #1
Format:        Microsoft Excel - Macro-Enabled Workbook
Confidence:    80.0%
Extension:     .xlsm
MIME:          application/vnd.ms-excel.sheet.macroEnabled.12
Offset:        0
Bytes Matched: b'PK\x03\x04\x14\x00\x06\x00xl/vbaProject.bin'
Hex:           504b 0304 1400 0600 786c 2f76 6261 5072 6f6a 6563 742e 6269 6e
String:        PKxl/vbaProject.bin

Alternate match #2
Format:        MS Office Open XML Format Document
Confidence:    80.0%
Extension:     .docx
MIME:          application/vnd.openxmlformats-officedocument.wordprocessingml.document
Offset:        0
Bytes Matched: b'PK\x03\x04\x14\x00\x06\x00'
Hex:           504b 0304 1400 0600
String:        PK

Alternate match #3
Format:        Microsoft Excel - Macro-Enabled Workbook
Confidence:    80.0%
Extension:     .xlsm
MIME:          application/vnd.ms-excel.sheet.macroEnabled.12
Offset:        0
Bytes Matched: b'PK\x03\x04\x14\x00\x06\x00'
Hex:           504b 0304 1400 0600
String:        PK

Alternate match #4
Format:        Microsoft Office 2007+ Open XML Format Document file
Confidence:    80.0%
Extension:     .xlsx
MIME:          application/vnd.openxmlformats-officedocument.wordprocessingml.document
Offset:        0
Bytes Matched: b'PK\x03\x04\x14\x00\x06\x00'
Hex:           504b 0304 1400 0600
String:        PK

Omitting other 20+ matches
cdgriffith commented 3 months ago

Thank you for this fix!