For Python 3.13: A drop-in replacement for `imghdr.what()`

cclauss commented 3 months ago

Closes #72

72

This might be more complete after #75 lands.

75

Current pytest results: 53 passed in 0.05s

test_what_from_file is 6 tests
test_what_from_file_none is 1 test
test_what_from_string is 38 tests
test_what_from_string_py311 is 2 tests
- Fail on imghdr only on Python < 3.11
test_what_from_string_todo is 6 tests
- Fail on puremagic

@NebularNerd Your reviews, please.

NebularNerd commented 3 months ago

I've just had a look at the code and really need to spend some more time on it. These may seem obvious questions but I'm lacking caffeine and sleep (long days this week) so they are just to help me understand a little better:

I assume you are converting bytes to hex when pushing them over to PureMagic? Looking at the test strings, 424d will work but BM would not if not first converted to hex.
Which strings/files/tests are failing?
Are you trying to get PureMagic to only say 'it's this' if it exactly matches your test strings? This would definitely cause hiccups.

I can see some improvements for your test strings for PBM, PGM and PPM so I'll add those to my PR later, but that should not affect the basic matches PureMagic already has. UPDATE These improvements are now in my PR. We can detect a SPACE, TAB or Win/*nix newline, additionally if they are immediately followed by a # that will improve confidence as well.

NebularNerd commented 3 months ago

Hi @cclauss, you might need to remove the : from Closes: to make GitHub link this PR to your issue.

Watching the commits with interest 🙂

cclauss commented 3 months ago

That Closes, Fixes, Resolves thing only works on the default branch.

https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue

NebularNerd commented 3 months ago

Ah! Did not notice you had the PR against the develop branch, my bad. 🙂

cclauss commented 3 months ago

There are certain cases where imghdr.what(file, h) and puremagic.what(file, h) disagree. My hunch is that puremagic is right and these incompatibilities are part of why the Python core team wants to remove imghdr.

Do you have suggestions for how we can always deliver identical results as imghdr.what(file, h)?

The current definition in this PR is:

def what(file: os.PathLike | str | None], h: str | bytes | None) -> str:

but I think we should make two changes to deliver optional bug-for-bug compatibility:

h: str | bytes | None --> h: bytes | None because imghdr.what() does not accept h = "str"

Add a strict_compat: bool = True option to deliver bug-for-bug results by default.

def what(file: os.PathLike | str | None], h: bytes | None, strict_compat: bool = True) -> str:
if strict_compat and answer := imghdr_compatibility(h):
    return answer

Thoughts on naming? strict_compat, strict_imghdr, imghdr_compat, bug_for_bug, others?

The imghdr_compatibility() (name is also up for debate) function would deal with cases like this PR's test_what_from_string_todo() function.

if not strict_compat: then we would return puremagic's results unmodified.

NebularNerd commented 3 months ago

What files/cases are causing issues? Without #75 they may not work or behave as expected at present. Feel free to fling a .zip over or link to an image and I'll take a look.

cclauss commented 3 months ago

What files/cases are causing issues?

I was focused on h, not file so the cases are in the parameterization of the test_what_from_string_todo() function.

        ("jpeg", b"______JFIF"),
        ("jpeg", b"______Exif"),
        ("rgb", b"\001\332"),
        ("tiff", b"II"),
        ("tiff", b"MM"),
        ("xbm", b"#define "),

It would be wonderful to add more image files in test/resources/images/ as discussed on test/test_main.py line 17. exr, pbm, pgm, ppm, ras, rgb, xbm

NebularNerd commented 3 months ago

Quick breakdown based on #75, I'm assuming you are converting to hex:

("jpeg", b"__JFIF") = Not tested for (yet), Not present at fixed location in every file, or in some cases not present at all. As mentioned in PR we need a REGEX to improve confidence IF JFIF/EXIF is present. This would be among the first improvements the 2.0 goals should address. Existing JPEG matching works well enough at present
("jpeg", b"__Exif") = As above
("rgb", b"\001\332") = Per my PR, the existing match is too specific (literally one specific file variant), when implemented new .json will match this short header for starters, then build confidence by adding variant info, see notes in PR for more details.
("tiff", b"II") = Poor match for real files would give very low confidence, Existing match follows real published file specs of b'II*\x00' for Intel / b'MM\x00*' for Motorola, more detailed notes in my PR about this.
("tiff", b"MM") = As above
("xbm", b"#define ") = New format needs my PR merged to be matched.

The JFIF/EXIF thing is a would be nice to have rather than critical to matching, a JPG will always have the base marker of 0xffd8ff / b'\xff\xd8\xff' so any real file thrown at PureMagic will match. Adding these would boost confidence if they are present in the file.

cdgriffith / puremagic

For Python 3.13: A drop-in replacement for `imghdr.what()` #76

72

75