cdgriffith / puremagic

Pure python implementation of identifying files based off their magic numbers
MIT License
161 stars 34 forks source link

Some common filetypes are not detected #12

Open victordomingos opened 6 years ago

victordomingos commented 6 years ago

Pure magic seems to be failing to detect some very common file types, like text files (.py, .txt, .md).

$ file changelog.txt
changelog.txt: ASCII English text

$ python3.6 -m puremagic ./changelog.txt
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py:125: 
RuntimeWarning: 'puremagic.__main__' found in sys.modules after import of package 
'puremagic', but prior to execution of 'puremagic.__main__'; this may result in 
unpredictable behaviour
  warn(RuntimeWarning(msg))
'./changelog.txt' : could not be Identified

$ python3.6 -m puremagic -m ./changelog.txt
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py:125: 
RuntimeWarning: 'puremagic.__main__' found in sys.modules after import of package 
'puremagic', but prior to execution of 'puremagic.__main__'; this may result in 
unpredictable behaviour
  warn(RuntimeWarning(msg))
'./changelog.txt' : could not be Identified
cdgriffith commented 6 years ago

You are correct, it is not able to detect these as those file types do not have file magic numbers for file detection and require additional analytics for a best guess that I have not written.

For example it does support Python files with their first line formatted as '#!/usr/bin/env python', whereas it would be better to upgrade this module to do some loser matching or some analytics to give more / better results. (Already tried to capture this idea in https://github.com/cdgriffith/puremagic/issues/3 but better spelled out with your example)

I don't have the time currently to work on it, but I at least remember how I thought about implementing I will capture in this issue:

ionecum commented 5 months ago

For example it does support Python files with their first line formatted as '#!/usr/bin/env python',

This would not work because shebang in Python (and also in other common types as well) is not mandatory. A shebang is only relevant to runnable scripts that you wish to execute without explicitly specifying the program to run them through. You wouldn't typically put a shebang in a Python module that only contains function and class definitions meant for importing from other modules. Therefore lots of python files does not have a shebang and this is not enough to identify a Python file. Also, this would not work in Python running on Windows.

cdgriffith commented 5 months ago

Correct @ionecum that is the example to show where pure magic could do better with a more in depth parser and not just matching the first lines of a file.

NebularNerd commented 4 months ago

Looking at a few .docx and xlsx files (I imagine the other formats are the same) they all seem to feature [Content_Types].xml ¢ starting at byte 30, this should improve the matching beyond the PK header and extension.

image

Will do some further digging. BarneyRubbleTheFlintstonesGIF

NebularNerd commented 4 months ago

Look at File Format: DOCX the best way to match these would be to regex through the file for the REL string, for example:

<Relationship Id#"rId1" Type#"http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" Target#"word/document.xml"/>

This would provide a solid hit every time, but would need some changes to the PureMagic logic. An idea for this might be:

  1. Still use the existing PK match to get the ball rolling, this would prevent excessive regexing by only using it for the secondary match.
  2. Use a modified version of the multi-part match to hold a regex string in hex (this ensures we could safely store required characters), I left the 0 in a dummy value for the example, we could of course ditch it as regex would not require this:
    "regex": {
    "504b0304": [      ["3C52656C6174696F6E736869702049642322724964312220547970652322687474703A2F2F736368656D61732E6F70656E786D6C666F726D6174732E6F72672F6F6666696365446F63756D656E742F323030362F72656C6174696F6E73686970732F6F6666696365446F63756D656E7422205461726765742322776F72642F646F63756D656E742E786D6C222F3E", 0, ".docx", "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
      "MS Office Open XML Format Word Document"]
    }
  3. Process and treat in a similar way to a regular multipart match

This would obviously have pros and cons. This obviously would look at anything with a PK header, potentially needing longer times to match, and heavier memory requirements if you have a huge file. The Pro would be in theory a solid high confidence match for a wide variety of PK based files if there is an obvious fingerprint we can identify.

@cdgriffith I know you are looking at ways to expand PureMagic's abilities, is this something that would be of interest?

EDIT: That's interesting, my proof-of-concept regex matcher works but still cannot outrank other matches, I even made a longer match for .docx using the same longer match for .xlsx of 0x504b030414000600. Alternate match 2 should be the clear winner here...

China tax introduction(edited1).docx
Most likely match:
Format:        MS Office Open XML Format Document
Confidence:    80.0%
Extension:     .docx
MIME:          application/vnd.openxmlformats-officedocument.wordprocessingml.document
Offset:        0
Bytes Matched: b'PK\x03\x04\x14\x00\x06\x00'
Hex:           504b 0304 1400 0600
String:        PK

Alternate match #1
Format:        Microsoft Office 2007+ Open XML Format Document file
Confidence:    80.0%
Extension:     .xlsx
MIME:          application/vnd.openxmlformats-officedocument.wordprocessingml.document
Offset:        0
Bytes Matched: b'PK\x03\x04\x14\x00\x06\x00'
Hex:           504b 0304 1400 0600
String:        PK

Alternate match #2
Format:        MS Office Open XML Format Word Document
Confidence:    80.0%
Extension:     .docx
MIME:          application/vnd.openxmlformats-officedocument.wordprocessingml.document
Offset:        0
Bytes Matched: b'PK\x03\x04\x14\x00\x06\x00word/document.xml'
Hex:           504b 0304 1400 0600 776f 7264 2f64 6f63 756d 656e 742e 786d 6c
String:        PKword/document.xml

Alternate match #3
Format:        MS Office Open XML Format Document
Confidence:    40.0%
Extension:     .pptx
MIME:          application/vnd.openxmlformats-officedocument.presentationml.presentation
Offset:        0
Bytes Matched: b'PK\x03\x04'
Hex:           504b 0304
String:        PK

Alternate match #4
Format:        MS Office Open XML Format Document
Confidence:    40.0%
Extension:     .xlsx
MIME:          application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Offset:        0
Bytes Matched: b'PK\x03\x04'
Hex:           504b 0304
String:        PK

Omitting other 20+ matches
ionecum commented 4 months ago

Hello, I see, some web forms may accept docx files or pdf (much more important), but your solution is too specific to Microsoft. What if a user sends a Libre Office or Only Office document from Linux or Mac? What if the user sends something from Android, which is today the most common case?

We should find a more general approach. Sincerely DR

On Sat, May 4, 2024 at 7:08 AM Andy @.***> wrote:

Look at File Format: DOCX https://docs.fileformat.com/word-processing/docx/#:~:text=A%20Docx%20file%20comprises%20of,files%20available%20in%20the%20archive the best way to match these would be to regex through the file for the REL string, for example:

<Relationship Id#"rId1" Type#" http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" Target#"word/document.xml"/>

This would provide a solid hit every time, but would need some changes to the PureMagic logic. An idea for this might be:

  1. Still use the existing PK match to get the ball rolling
  2. Use a modified version of the multi-part match to hold a regex string in hex (this ensures we could safely store required characters), I left the 0 in a dummy value for the example, we could of course ditch it as regex would not require this:

"regex": { "464f524d": [ ["3C52656C6174696F6E736869702049642322724964312220547970652322687474703A2F2F736368656D61732E6F70656E786D6C666F726D6174732E6F72672F6F6666696365446F63756D656E742F323030362F72656C6174696F6E73686970732F6F6666696365446F63756D656E7422205461726765742322776F72642F646F63756D656E742E786D6C222F3E", 0, ".docx", "application/vnd.openxmlformats-officedocument.wordprocessingml.document", "MS Office Open XML Format Word Document"] }

  1. Process and treat in a similar way to a regular multipart match

This would obviously have pros and cons. This obviously would look at anything with a PK header, potentially needing longer times to match, and heavier memory requirements if you have a huge file. The Pro would be in theory a solid high confidence match.

@cdgriffith https://github.com/cdgriffith I know you are looking at ways to expand PureMagic's abilities, is this something that woud be of interest?

— Reply to this email directly, view it on GitHub https://github.com/cdgriffith/puremagic/issues/12#issuecomment-2094122612, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASPVE66J6HTYQSMXDAF5ICDZAS6R3AVCNFSM4E653PV2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBZGQYTEMRWGEZA . You are receiving this because you were mentioned.Message ID: @.***>

cdgriffith commented 4 months ago

@NebularNerd yes would love to see a rules engine be able to be engaged after the initial fast match, that could do those deeper searchers. Similar to outlined at the start of the issue https://github.com/cdgriffith/puremagic/issues/12#issuecomment-387574987

NebularNerd commented 4 months ago

Hello, I see, some web forms may accept docx files or pdf (much more important), but your solution is too specific to Microsoft. What if a user sends a Libre Office or Only Office document from Linux or Mac? What if the user sends something from Android, which is today the most common case? We should find a more general approach. Sincerely DR On Sat, May 4, 2024 at 7:08 AM Andy @.> wrote: Look at File Format: DOCX https://docs.fileformat.com/word-processing/docx/#:~:text=A%20Docx%20file%20comprises%20of,files%20available%20in%20the%20archive the best way to match these would be to regex through the file for the REL string, for example: <Relationship Id#"rId1" Type#" http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" Target#"word/document.xml"/> This would provide a solid hit every time, but would need some changes to the PureMagic logic. An idea for this might be: 1. Still use the existing PK match to get the ball rolling 2. Use a modified version of the multi-part match to hold a regex string in hex (this ensures we could safely store required characters), I left the 0 in a dummy value for the example, we could of course ditch it as regex would not require this: "regex": { "464f524d": [ ["3C52656C6174696F6E736869702049642322724964312220547970652322687474703A2F2F736368656D61732E6F70656E786D6C666F726D6174732E6F72672F6F6666696365446F63756D656E742F323030362F72656C6174696F6E73686970732F6F6666696365446F63756D656E7422205461726765742322776F72642F646F63756D656E742E786D6C222F3E", 0, ".docx", "application/vnd.openxmlformats-officedocument.wordprocessingml.document", "MS Office Open XML Format Word Document"] } 3. Process and treat in a similar way to a regular multipart match This would obviously have pros and cons. This obviously would look at anything with a PK header, potentially needing longer times to match, and heavier memory requirements if you have a huge file. The Pro would be in theory a solid high confidence match. @cdgriffith https://github.com/cdgriffith I know you are looking at ways to expand PureMagic's abilities, is this something that woud be of interest? — Reply to this email directly, view it on GitHub <#12 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASPVE66J6HTYQSMXDAF5ICDZAS6R3AVCNFSM4E653PV2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBZGQYTEMRWGEZA . You are receiving this because you were mentioned.Message ID: @.>

From my proof-of-concept this is mainly to see how we can improve matching for any and all files. PureMagic is pretty awesome and can already match most of the document types you mentioned.

In regards to files that are all essentially a .zip we need to find other markers that are always present to improve confidence rates. I'm sure that we could find similar inside the Libre etc... formats. I picked on .docx/.xlsx as I have plenty of those to test against, not because I'm Microsoft centric.

NebularNerd commented 4 months ago

@NebularNerd yes would love to see a rules engine be able to be engaged after the initial fast match, that could do those deeper searchers. Similar to outlined at the start of the issue #12 (comment)

As a proof-of-concept it works, maybe I'll run up a PR and you can take a look and see what you think (you may even find a better implementation). For the time being I've got a way to drop them into the existing Multi-Match so we don't have to reinvent the wheel.

My main concern is that even with such a large confidence match the lesser matches still 'win' (Confidence 'winner' fixed if #66 is approved)

NebularNerd commented 3 months ago

I'll add this here for now as it's on mostly topic, but more of a 2.0 goal than an immediate solution. I had a random thought regarding PK based files, rather than trying to reinvent the wheel. Why no use Python's PKZIP support to get a file listing through ZipFile.namelist()? It's not an external module so that fits with design goals, and rather than complicated regex-ing we can just get a list of files and match against an expected list. Convert everything to .lower() to save worrying about filename casing and match to our hearts content.

.apk, .docx, .jar, .xlsx and anything else would be almost instantly matchable. If we know a .docx has x files that would always be present we could test for their presence in the file.

This would be a secondary step to byte matching but it opens up a possible solution for dealing with those files. Scores would still be calculated the same way, just treat the matched file path (preferably paths to allow even longer matches) as if it were a byte string.

Sample file names you could match for:

.docx:

[Content_Types].xml
word\_rels\document.xml.rels

.jar:

META-INF\MANIFEST.MF

.apk:

AndroidManifest.xml
META-INF\MANIFEST.MF
res  <--- folder
lib <--- folder
assets <--- folder

.odt: (Would need to test more files to confirm contents)

META-INF\MANIFEST.MF
content.xml
meta.xml