JohannesBuchner / imagehash

A Python Perceptual Image Hashing Module
BSD 2-Clause "Simplified" License
3.28k stars 331 forks source link

Many images are mistakenly identified as identical. #150

Closed razrabotkajava closed 2 years ago

razrabotkajava commented 3 years ago

Hello, I have a small database that contains hashes of photos

when I try to find similar photos to the one below:

for which the following hash was calculated: "0f3f2764ecc482c2" using the method average_hash() imagehash.average_hash(Image.open(imagePath))

The system finds a very large number of collisions, below is an example of photos that were identified as completely identical:

The table in which I store hashes of photos:

CREATE TABLE IF NOT EXISTS photos(id_photo BIGINT PRIMARY  KEY,  photo_hash TEXT,  FOREIGN KEY (id_photo) REFERENCES users (id));

Adding hash: cur.execute('''INSERT INTO photos(id_photo, photo_hash) VALUES(%s, %s)''', id, str(result_image_recognition['photo_hash'])]) SQL Query by which I calculate the Euclidean distance between my hash and stored hashes: SELECT id_photo, BIT_COUNT(photo_hash ^ '0f3f2764ecc482c2') FROM photos;

The photos table contains 7889 photos, 959 of them are mistakenly determined by this query as completely identical (Euclidean distance is 0). About a week I can not solve this problem please someone help me.

JohannesBuchner commented 3 years ago

Have a look at the images after image.convert("L").resize((hash_size, hash_size), Image.ANTIALIAS), which is what average_hash operates on. hash_size is 8 by default.

JohannesBuchner commented 3 years ago

Recompute the hash of the two examples to verify it is the same. We want to avoid that a sql problem is the cause.

For example, do you need to write x'0f3f2764ecc482c2'? https://dev.mysql.com/doc/refman/8.0/en/hexadecimal-literals.html

dsbferris commented 2 years ago
I found the same issue with avgHash. My way around is to define the hashsize. As experiment with my collection of photos (4,5GB, 3500 files) mostly random things, people, landscape, memes etc. I ran a loop with hashsizes from 8 to 1024 in steps of powers of 2 and counting the number of duplicates. These are my results: hashsize #duplicates
8 213
16 187
32 182
64 179
128 179
256 179
512 179
1024 179

Generating the hashes for each image took 970 secs. Generating hashes larger then 1024 to like 8192 takes an eternity. Not gonna try that. As you can see it doesn't drop anymore after 64. So let's see how it behaves from 32 to 64 in steps of 4.

hashsize #duplicates
32 182
36 180
40 181
44 180
48 180
52 180
56 181
60 180
64 179

Generating the hashes for each image took 207 secs. Funny how 40 and 56 got more the lower sizes.

So I guess I will stick with 64 as hashsize.

JohannesBuchner commented 2 years ago

I'm closing this as it is a property of the implemented algorithms. However, blog posts for user guidance for how to best choose hash algorithm and settings would be appreciated and I could link to them in the README.