cdgriffith / puremagic

Pure python implementation of identifying files based off their magic numbers
MIT License
158 stars 34 forks source link

For Python 3.13: A drop-in replacement for `imghdr.what()` #76

Closed cclauss closed 3 months ago

cclauss commented 3 months ago

Closes #72

This might be more complete after #75 lands.

Current pytest results: 53 passed in 0.05s

@NebularNerd Your reviews, please.

NebularNerd commented 3 months ago

I've just had a look at the code and really need to spend some more time on it. These may seem obvious questions but I'm lacking caffeine and sleep (long days this week) so they are just to help me understand a little better:

  1. I assume you are converting bytes to hex when pushing them over to PureMagic? Looking at the test strings, 424d will work but BM would not if not first converted to hex.
  2. Which strings/files/tests are failing?
  3. Are you trying to get PureMagic to only say 'it's this' if it exactly matches your test strings? This would definitely cause hiccups.

I can see some improvements for your test strings for PBM, PGM and PPM so I'll add those to my PR later, but that should not affect the basic matches PureMagic already has. UPDATE These improvements are now in my PR. We can detect a SPACE, TAB or Win/*nix newline, additionally if they are immediately followed by a # that will improve confidence as well.

NebularNerd commented 3 months ago

Hi @cclauss, you might need to remove the : from Closes: to make GitHub link this PR to your issue.

Watching the commits with interest 🙂

cclauss commented 3 months ago

That Closes, Fixes, Resolves thing only works on the default branch.

NebularNerd commented 3 months ago

Ah! Did not notice you had the PR against the develop branch, my bad. 🙂

cclauss commented 3 months ago

There are certain cases where imghdr.what(file, h) and puremagic.what(file, h) disagree. My hunch is that puremagic is right and these incompatibilities are part of why the Python core team wants to remove imghdr.

Do you have suggestions for how we can always deliver identical results as imghdr.what(file, h)?

The current definition in this PR is:

def what(file: os.PathLike | str | None], h: str | bytes | None) -> str:

but I think we should make two changes to deliver optional bug-for-bug compatibility:

  1. h: str | bytes | None --> h: bytes | None because imghdr.what() does not accept h = "str"
  2. Add a strict_compat: bool = True option to deliver bug-for-bug results by default.
    def what(file: os.PathLike | str | None], h: bytes | None, strict_compat: bool = True) -> str:
    if strict_compat and answer := imghdr_compatibility(h):
        return answer

    Thoughts on naming? strict_compat, strict_imghdr, imghdr_compat, bug_for_bug, others?

The imghdr_compatibility() (name is also up for debate) function would deal with cases like this PR's test_what_from_string_todo() function.

if not strict_compat: then we would return puremagic's results unmodified.

NebularNerd commented 3 months ago

What files/cases are causing issues? Without #75 they may not work or behave as expected at present. Feel free to fling a .zip over or link to an image and I'll take a look.

cclauss commented 3 months ago

What files/cases are causing issues?

I was focused on h, not file so the cases are in the parameterization of the test_what_from_string_todo() function.

        ("jpeg", b"______JFIF"),
        ("jpeg", b"______Exif"),
        ("rgb", b"\001\332"),
        ("tiff", b"II"),
        ("tiff", b"MM"),
        ("xbm", b"#define "),

It would be wonderful to add more image files in test/resources/images/ as discussed on test/test_main.py line 17. exr, pbm, pgm, ppm, ras, rgb, xbm

NebularNerd commented 3 months ago

Quick breakdown based on #75, I'm assuming you are converting to hex:

The JFIF/EXIF thing is a would be nice to have rather than critical to matching, a JPG will always have the base marker of 0xffd8ff / b'\xff\xd8\xff' so any real file thrown at PureMagic will match. Adding these would boost confidence if they are present in the file.