bojand / infer

Small crate to infer file and MIME type by checking the magic number signature
MIT License
262 stars 25 forks source link

`infer` incorrectly identifies `mkv` as `webm` #96

Open sobaq opened 1 week ago

sobaq commented 1 week ago

Demo:

$ ffprobe example.mkv
...
Input #0, matroska,webm, from example.mkv:
  Metadata:
    ENCODER         : Lavf60.16.100
  Duration: 00:02:46.10, start: 0.000000, bitrate: 42034 kb/s
  Stream #0:0: Video: hevc (Main), yuv420p(tv, bt709), 1920x1080 [SAR 1:1 DAR 16:9], 50 fps, 50 tbr, 1k tbn
...

$ cargo run -- example.mkv
Inferred: Ok(Some(Type { matcher_type: Video, mime_type: "video/webm", extension: "webm" }))
Code ```rust fn main() { let inf = std::env::args().nth(1).unwrap(); println!("Inferred: {:?}", infer::get_from_path(&inf)); } ```

The current code detects two byte patterns. This file doesn't contain the first one:

$ xxd -d example.mkv | head -n1
00000000: 1a45 dfa3 a342 8681 0142 f781 0142 f281  .E...B...B...B..
$ #                 ^ diverges here

And does contain the second one, but at a different offset (24-31 instead of 31-38):

$ xxd -d example.mkv | grep "6d61 7472 6f73 6b61"
00000016: 0442 f381 0842 8288 6d61 7472 6f73 6b61  .B...B..matroska
sobaq commented 1 week ago

Here's an example: example.zip. It's H.264 + Opus and exhibits similar behaviour. I recorded it with OBS.

sobaq commented 1 week ago

It appears file works by matching 1a45 dfa3 at the beginning, then searching for the pattern \x42\x82.matroska (in regex syntax) anywhere in the first 4K.

4K seems very large. My understanding of the spec is that Matroska files must start with an EBML document, which must start with a header, which must contain their docType (matroska). The header can only be so big, so I think searching within the first ~256 bytes is fair.

I'll make a PR for this soon 👍