akamhy / videohash

Near Duplicate Video Detection (Perceptual Video Hashing) - Get a 64-bit comparable hash-value for any video.
https://pypi.org/project/videohash
MIT License
264 stars 41 forks source link

Video hashs on vastly different videos yield is_similar() True #100

Open christopherwingert opened 1 year ago

christopherwingert commented 1 year ago

Would modifying similar_percentage help? If so, which direction should I go?

96jaco96 commented 1 year ago

I'm having this problem too, and i've spend all day today debugging why does this happen.

So far i've discovered this:

the "is_similar" function in videohash.py do this check:

if self - other <= ceil((self.similar_percentage / 100) * self.bits_in_hash)

BUT videohash.py also defines these two things: self.bits_in_hash = 64 self.similar_percentage = 15

so the previous check ALWAYS boils down to: if self - other <= ceil((15 / 100) * 64) which is ALWAYS = 10

basically changing the "is_similar" function from if self - other <= ceil((self.similar_percentage / 100) * self.bits_in_hash) to if self - other <= 10 returns the same results, and i've tested this with a semple of 1000 videos. The results are identical both with the default check and when using "if self - other <= 10"

Correct me if i'm wrong, i'm quite noob-ish here and just doing some observations... infact i'm not even sure mathematically speaking what this check is doing exactly.

ALSO i think this can be related somewhat to the issue #94 "Hash Collision" if that might help...