Many images are mistakenly identified as identical.

razrabotkajava commented 3 years ago

Hello, I have a small database that contains hashes of photos

when I try to find similar photos to the one below:

for which the following hash was calculated: "0f3f2764ecc482c2" using the method average_hash() imagehash.average_hash(Image.open(imagePath))

The system finds a very large number of collisions, below is an example of photos that were identified as completely identical:

The table in which I store hashes of photos:

CREATE TABLE IF NOT EXISTS photos(id_photo BIGINT PRIMARY  KEY,  photo_hash TEXT,  FOREIGN KEY (id_photo) REFERENCES users (id));

Adding hash: cur.execute('''INSERT INTO photos(id_photo, photo_hash) VALUES(%s, %s)''', id, str(result_image_recognition['photo_hash'])]) SQL Query by which I calculate the Euclidean distance between my hash and stored hashes: SELECT id_photo, BIT_COUNT(photo_hash ^ '0f3f2764ecc482c2') FROM photos;

The photos table contains 7889 photos, 959 of them are mistakenly determined by this query as completely identical (Euclidean distance is 0). About a week I can not solve this problem please someone help me.

JohannesBuchner commented 3 years ago

Have a look at the images after image.convert("L").resize((hash_size, hash_size), Image.ANTIALIAS), which is what average_hash operates on. hash_size is 8 by default.

JohannesBuchner commented 3 years ago

Recompute the hash of the two examples to verify it is the same. We want to avoid that a sql problem is the cause.

For example, do you need to write x'0f3f2764ecc482c2'? https://dev.mysql.com/doc/refman/8.0/en/hexadecimal-literals.html

dsbferris commented 2 years ago

I found the same issue with avgHash. My way around is to define the hashsize. As experiment with my collection of photos (4,5GB, 3500 files) mostly random things, people, landscape, memes etc. I ran a loop with hashsizes from 8 to 1024 in steps of powers of 2 and counting the number of duplicates. These are my results:	hashsize	#duplicates
8	213
16	187
32	182
64	179
128	179
256	179
512	179
1024	179

Generating the hashes for each image took 970 secs. Generating hashes larger then 1024 to like 8192 takes an eternity. Not gonna try that. As you can see it doesn't drop anymore after 64. So let's see how it behaves from 32 to 64 in steps of 4.

hashsize	#duplicates
32	182
36	180
40	181
44	180
48	180
52	180
56	181
60	180
64	179

Generating the hashes for each image took 207 secs. Funny how 40 and 56 got more the lower sizes.

So I guess I will stick with 64 as hashsize.

JohannesBuchner commented 2 years ago

I'm closing this as it is a property of the implemented algorithms. However, blog posts for user guidance for how to best choose hash algorithm and settings would be appreciated and I could link to them in the README.

JohannesBuchner / imagehash

Many images are mistakenly identified as identical. #150