cdgriffith / puremagic

Pure python implementation of identifying files based off their magic numbers
MIT License
161 stars 34 forks source link

Confidence/Selection logic question #29

Closed CSBaum closed 4 years ago

CSBaum commented 4 years ago

Hi,

I just found PureMagic and am trying to use it to identify if a file my script receives is ELF or not. I am using a test ELF binary and instead of get back "ELF executable" as i would expect I am getting ".AppImage".

I did run readelf against the file and here is results:

$ readelf -h elf_hello/chello
ELF Header:
  Magic:   7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF32
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           Intel 80386
  Version:                           0x1
  Entry point address:               0x80482f0
  Start of program headers:          52 (bytes into file)
  Start of section headers:          1904 (bytes into file)
  Flags:                             0x0
  Size of this header:               52 (bytes)
  Size of program headers:           32 (bytes)
  Number of program headers:         7
  Size of section headers:           40 (bytes)
  Number of section headers:         27
  Section header string table index: 26

$ python3 -m puremagic elf_hello/chello
'elf_hello/chello' : .AppImage

I also dug into the magic_data.json file and found out that those 2 file types share a lot of the same bytes:

  "454c46", 1, ".AppImage"
"7f454c46", 0, "", "", "ELF executable"

After doing some more digging it looks like puremagic find both options but always returns the AppImage entry.

These are the 2 results from the confidence function:

PureMagicWithConfidence(byte_match=b'ELF', offset=1, extension='.AppImage', mime_type='application/x-iso9660-appimage', name='AppImage application bundle', confidence=0.9)
PureMagicWithConfidence(byte_match=b'\x7fELF', offset=0, extension='', mime_type='', name='ELF executable', confidence=0.9)

I admit that i can be totally blind and am not seeing where the logic decides which one to choose. I'd get it if it looked at the file extension and saw that there wasn't one and choose the ELF executable vs. the AppImage, but it looks like it is a toss up when the confidence level is the same...

Thanks in advance for any insight, suggestions, etc :)

cdgriffith commented 4 years ago

Thank you for bringing this up! There was an issue that it should have been looking at the length of how many identifying bytes it found to assign confidence and that was not set properly.

cdgriffith commented 4 years ago

Please try pip install puremagic>=1.10 and see if that fixes it for you!

Thanks again for bringing that up, can't believe it wasn't caught before!