barrycarey / RedditRepostSleuth

A high performance repost detection and administration bot for Reddit.
https://repostsleuth.com
GNU General Public License v3.0
175 stars 10 forks source link

Add video post duplication detection support with videohash #303

Open akamhy opened 2 years ago

akamhy commented 2 years ago

Nice bot, I came across your bot's comment on some subreddit and I noticed that it lacks video support.

I am @akamhy and I am the creator of videohash, a Near Duplicate Video Detection python library. I would like to know if you are interested in supporting video posts duplication detector with the videohash library?

barrycarey commented 2 years ago

Hello,

I hadn't seen your library before but that looks like it would work really well. I had put together a solution in the past the generated hashes of a set of frames. However, it didn't scale well.

How does video hash do the comparison for lookup? The database of hashes would likely be over 100 million videos. I'm sure I could plug it into my solution for images but would be interested in another approach.

akamhy commented 2 years ago

How does video hash do the comparison for lookup? The database of hashes would likely be over 100 million videos. I'm sure I could plug it into my solution for images but would be interested in another approach.

Similar to ImageHash, it(videohash) calculates the hamming distance of 64 bits to differentiate videos. So the time required to query a videohash and imagehash should be similar. It should be identical to what you are doing with ImageHash.

Possible areas you should check before using it in production are the hashing time(too slow for your usage?) and collisions(too many collisions?). Also I ready to make changes to the library for making it more suitable for this particular use case.

Maybe you should try it out on some sample videos and suggest some changes iff required to the library.

barrycarey commented 2 years ago

I'm only using imagehash to create the hashes. I'm using a different solution for comparison since directly doing hamming distance didn't scale. However, it looks like I can do exactly the same thing with video hash.

I should be able to test it out in the next couple weeks. I'm pretty limited on time right now

I appreciate the heads up, I had no idea this existed.