bojand / infer

Small crate to infer file and MIME type by checking the magic number signature
MIT License
299 stars 28 forks source link

Invalid Java class file magic byte? #86

Closed criminosis closed 1 year ago

criminosis commented 1 year ago

I was running infer over some Java class files and wasn't getting a hit as Java, but instead as application/x-mach-binary. Setting aside the collision of magic byte with Mach-O's definition aside for the moment, I'm not sure where the Java magic byte is coming from?

According to the spec Java class files start with 0xCAFEBABE. I tried looking into the history of where Infer's current Java magic byte came from but it looks like it has been there since the initial commit to this repo and there didn't seem any more context nor other open issues.

This does mean a possible corrected Java matcher would collide with Mach-O's matcher.

@bojand I'd be happy to put up a PR to fix both issues if there's appetite for it. This comment in the Mach-O matcher is already referencing a post detailing how libmagic gets around this magic byte collision.

After the common magic byte, Java devotes 2 bytes to a minor version and then 2 bytes to a major version of the class file. Major versions start at 45, versions less than 45 are pre Java 1.1 and presumed from its pre-historic Oak period.

Mach-O devotes the full 4 bytes to specifying the number of multi-arch entries in the "fat" file. There's 18 defined archetypes, AFAIK, that a Mach-O archive could contain to at this time.

Given new widespread CPU architectures are few and far between nowadays the comment from libmagic seems like a reasonable "hack" here to discriminate between the two:

  1. Load in the 4 bytes after a matched magic byte of 0xCAFEBABE.
  2. If the 4 bytes are less than 45, then it's Mach-O (should be fine until an additional 27 new CPU architectures are added to Mach-O).
  3. If the number is greater than 45 then it's a Java class file.
    • For extra due diligence, because the minor version is the 2 bytes before the major version's 2 bytes, that would give a radically large number if viewed as a single 4 byte value. Once it's confirmed the whole 4 bytes are > 45, we should parse down to just the latter 2 bytes and confirm just those 2 bytes are greater than 45.

Fwiw it seemed like Infer was also lacking a Java class file test case, so I'd add that for extra confirmation in my PR. It looks like it already has some Mach-O samples for testing.

criminosis commented 1 year ago

@bojand Put up my PR for your consideration https://github.com/bojand/infer/pull/87

Again I'm not sure what the original Java matcher was for, but happy to put it back as an additional check if that was known to match a particular case.

If you're good with the change and merge it mind cutting a new release of Infer with it?