JohannesBuchner / imagehash

A Python Perceptual Image Hashing Module
BSD 2-Clause "Simplified" License
3.12k stars 328 forks source link

Hash size doesn't match hash_size parameter for Daubechies wavelets hashing #149

Open jonemo opened 2 years ago

jonemo commented 2 years ago

I am surprised that the size of the hash computed is not equal to the hash_size parameter available for all hashing methods. Specifically, imagehash.whash(img, hash_size=16, mode="db4") yields a hash of size 22 x 22.

While the readme does not make any explicit promises about the hash size, the naming of parameters makes this outcome quite unexpected. Of course, me being surprised is not an issue in itself and unless this is a bug, it would be unreasonable to break backward compatibility with a change in API or behavior. However, maybe it's worth adding clarification that hash_size does not always match hash size in the documentation/readme?

The readme currently covers hash_size in this paragraph:

Each algorithm can also have its hash size adjusted (or in the case of colorhash, its binbits). Increasing the hash size allows an algorithm to store more detail in its hash, increasing its sensitivity to changes in detail.

Sample code:

    img = Image.open(path)
    hash = imagehash.average_hash(img, hash_size=16)
    print(f"average_hash: {len(hash.hash)} x {len(hash.hash[0])}")
    hash = imagehash.dhash(img, hash_size=16)
    print(f"dhash: {len(hash.hash)} x {len(hash.hash[0])}")
    hash = imagehash.phash(img, hash_size=16)
    print(f"phash: {len(hash.hash)} x {len(hash.hash[0])}")
    hash = imagehash.whash(img, hash_size=16, mode="haar")
    print(f"whash haar: {len(hash.hash)} x {len(hash.hash[0])}")
    hash = imagehash.whash(img, hash_size=16, mode="db4")
    print(f"whash db4: {len(hash.hash)} x {len(hash.hash[0])}")

Output:

average_hash: 16 x 16
dhash: 16 x 16
phash: 16 x 16
whash haar: 16 x 16
whash db4: 22 x 22

Example image:

tl-20210924-185242

JohannesBuchner commented 2 years ago

Huh. Do you know why db4 does that?

jonemo commented 2 years ago

Sorry, I am the wrong person to ask this question. I used imagehash precisely because I have no clue about any of these algorithms. (And that was a year ago, now I know even less.)

JohannesBuchner commented 2 years ago

The two wavelet shapes: http://wavelets.pybytes.com/wavelet/db4/ http://wavelets.pybytes.com/wavelet/haar/ (nothing obvious there)

The whash function calls this: https://pywavelets.readthedocs.io/en/latest/ref/2d-dwt-and-idwt.html#d-multilevel-decomposition-using-wavedec2 with default mode ('symmetric')

Maybe have a look at the input and output of this call https://github.com/JohannesBuchner/imagehash/blob/master/imagehash/__init__.py#L385

JohannesBuchner commented 2 years ago

In any case, given how differently the various methods work, no, hash_size does not necessarily have to have a consistent meaning across all methods.