akamhy / videohash

Near Duplicate Video Detection (Perceptual Video Hashing) - Get a 64-bit comparable hash-value for any video.
https://pypi.org/project/videohash
MIT License
264 stars 41 forks source link

Hash Collision #94

Open MikPisula opened 1 year ago

MikPisula commented 1 year ago

Describe the bug Hash collision occurs with videos of the same length and with similar colour schemes.

To Reproduce

v1 = VideoHash(url='https://user-images.githubusercontent.com/47534140/185008752-da1f09c7-a177-4a46-9c64-230744e998c1.mp4')
v2 = VideoHash(url='https://user-images.githubusercontent.com/47534140/185008748-b8922142-37cc-48a0-bad9-1385ba016587.mov')
print (v1 == v2)

Expected behavior The hashes of the videos should be different.

Screenshots NA

Please complete the following information:

Additional context

Demmenie commented 1 year ago

I have noticed this too, any idea what's causing it?

dale-wahl commented 1 year ago

Just learning this library myself, but if you check out the collages, you can see the collage images are virtually identical (located at v1.collage_path and v2.collage_path). Basically the scenes are too short and as far as the video hash is concerned consist of white pixels at the exact same two points and black everywhere else. My guess is that this will not be an effective tool with short videos such as the two in the example. I have been trying to find recommendations on minimum scene lengths.

Just did some testing, and you can increase the number of frames per second. Check out the results of this:

v1 = VideoHash(url='https://user-images.githubusercontent.com/47534140/185008752-da1f09c7-a177-4a46-9c64-230744e998c1.mp4',frame_interval=5)
v2 = VideoHash(url='https://user-images.githubusercontent.com/47534140/185008748-b8922142-37cc-48a0-bad9-1385ba016587.mov',frame_interval=5)
print (v1 == v2)
# and compare their collages to the ones you created without using frame_interval
print(v1.collage_path)
print(v2.collage_path)
Demmenie commented 1 year ago

I'll have to look at that example later. I've also had the opposite problem where the same video will produce different hashes, not to mention that it always takes a few seconds to run which is quite long for real-world applications these days.

I think I'll either have to fork this and see if I can improve or switch to using something else. I'd also like to see if I can add partial fingerprint, where a video that's part of another one can be recognised as such.

MikPisula commented 1 year ago

Just learning this library myself, but if you check out the collages, you can see the collage images are virtually identical (located at v1.collage_path and v2.collage_path). Basically the scenes are too short and as far as the video hash is concerned consist of white pixels at the exact same two points and black everywhere else. My guess is that this will not be an effective tool with short videos such as the two in the example. I have been trying to find recommendations on minimum scene lengths.

Just did some testing, and you can increase the number of frames per second. Check out the results of this:

v1 = VideoHash(url='https://user-images.githubusercontent.com/47534140/185008752-da1f09c7-a177-4a46-9c64-230744e998c1.mp4',frame_interval=5)
v2 = VideoHash(url='https://user-images.githubusercontent.com/47534140/185008748-b8922142-37cc-48a0-bad9-1385ba016587.mov',frame_interval=5)
print (v1 == v2)
# and compare their collages to the ones you created without using frame_interval
print(v1.collage_path)
print(v2.collage_path)

The issue of collages for short videos being almost entirely black seems to stem from the fact that the width of the collage is set to 1024px no matter what. Instead, i tried editing collagemaker.py so that it would calculate the width of the collage based on the already-existing variable self.images_per_row_in_collage, and it resulted in much nicer collages although i have not tested it extensively. From my limited testing it produces the same hash for a video when:

  1. it is converted to a different format (tested on .mov)
  2. is is compressed
  3. it is downscaled (by 50%)

And, more importantly, it produces different hashes for the two videos I uploaded in the original issue.

Link: https://github.com/MikPisula/videohash/commit/b4b8f320c839790e94d11ace2cc850ce9cd450ae

MikPisula commented 1 year ago

When it comes to the performance, perhaps the python multiprocessing library could be used to speed up the image-manipulation part?

Demmenie commented 1 year ago

It could do but it has to be done in a way that works across devices. I think an algorithm with decent time complexity would be best. I'm also thinking it might be better to start over than to fork. I'd like to see if video fingerprinting might be possible.

Edit: I just found this: https://pypi.org/project/videofingerprint/ Looks like @akamhy was working on it but the repo doesn't exist anymore. (Gonna start a separate issue for speed)