cdgriffith / puremagic

Pure python implementation of identifying files based off their magic numbers
MIT License
158 stars 34 forks source link

Multi part reverse lookup #59

Closed NebularNerd closed 4 months ago

NebularNerd commented 5 months ago

Closes #57

While digging through files for PR #58 I discovered that we could do with a footer style reverse lookup on the multi part match to help boost confidence scores (especially if the footer is small as well). This modifies the multiple match to allow reverse looksup.

To behave like the normal forward match, we aggregate the matched.byte_match and magic_row.byte_match to provide a longer match to the _confidence function. One downside of the is that in the match data it will report the two matches smooshed together, e.g.:

Also, some unexpected results:

Sample files.zip

Any improvements/comments are welcome, it works but there might be a better, nicer looking way to handle this.

New entries for magic_data.json:

NebularNerd commented 5 months ago

Not sure why checks are failing, still a bit of a GitHub noob when it comes to certain aspects.

EDIT: Fixed it, ran the code block I modified through Black Playground and all is well. Will have to look more into that later for debugging/prettifying my own code.

NebularNerd commented 4 months ago

Conflicts resolved 🙂