Closed razrabotkajava closed 2 years ago
Have a look at the images after image.convert("L").resize((hash_size, hash_size), Image.ANTIALIAS)
,
which is what average_hash operates on. hash_size is 8 by default.
Recompute the hash of the two examples to verify it is the same. We want to avoid that a sql problem is the cause.
For example, do you need to write x'0f3f2764ecc482c2'? https://dev.mysql.com/doc/refman/8.0/en/hexadecimal-literals.html
I found the same issue with avgHash. My way around is to define the hashsize. As experiment with my collection of photos (4,5GB, 3500 files) mostly random things, people, landscape, memes etc. I ran a loop with hashsizes from 8 to 1024 in steps of powers of 2 and counting the number of duplicates. These are my results: | hashsize | #duplicates |
---|---|---|
8 | 213 | |
16 | 187 | |
32 | 182 | |
64 | 179 | |
128 | 179 | |
256 | 179 | |
512 | 179 | |
1024 | 179 |
Generating the hashes for each image took 970 secs. Generating hashes larger then 1024 to like 8192 takes an eternity. Not gonna try that. As you can see it doesn't drop anymore after 64. So let's see how it behaves from 32 to 64 in steps of 4.
hashsize | #duplicates |
---|---|
32 | 182 |
36 | 180 |
40 | 181 |
44 | 180 |
48 | 180 |
52 | 180 |
56 | 181 |
60 | 180 |
64 | 179 |
Generating the hashes for each image took 207 secs. Funny how 40 and 56 got more the lower sizes.
So I guess I will stick with 64 as hashsize.
I'm closing this as it is a property of the implemented algorithms. However, blog posts for user guidance for how to best choose hash algorithm and settings would be appreciated and I could link to them in the README.
Hello, I have a small database that contains hashes of photos
when I try to find similar photos to the one below:
for which the following hash was calculated: "0f3f2764ecc482c2" using the method average_hash()
imagehash.average_hash(Image.open(imagePath))
The system finds a very large number of collisions, below is an example of photos that were identified as completely identical:
The table in which I store hashes of photos:
Adding hash:
cur.execute('''INSERT INTO photos(id_photo, photo_hash) VALUES(%s, %s)''', id, str(result_image_recognition['photo_hash'])])
SQL Query by which I calculate the Euclidean distance between my hash and stored hashes:SELECT id_photo, BIT_COUNT(photo_hash ^ '0f3f2764ecc482c2') FROM photos;
The photos table contains 7889 photos, 959 of them are mistakenly determined by this query as completely identical (Euclidean distance is 0). About a week I can not solve this problem please someone help me.