arsenetar / dupeguru

Find duplicate files
https://dupeguru.voltaicideas.net
GNU General Public License v3.0
5.41k stars 415 forks source link

fuzzy movie file/segment matching #675

Open mummifiedclown opened 4 years ago

mummifiedclown commented 4 years ago

I know this might be a big ask, but I think I have a potentially good algorithm for it.

I'm looking for the ability to match movie files (or segments that match part of an existing larger movie file) saved with different encoding or resolution. I think it could be done leveraging DupeGuru's existing fuzzy image matching.

What I have in mind would require incorporating ffmpeg into DG, and using it to generate a digest of significant thumbnails (shot changes) for each video (https://superuser.com/questions/538112/meaningful-thumbnails-for-a-video-using-ffmpeg) and maybe a record of elapsed time between them as well. Then, if the first thumbnail matches a thumbnail from another movie file, check subsequent thumbs and timestamps and base the matching on that. This would match both movies of equal length and content, as well as movies that might be cuts from others.

glubsy commented 4 years ago

I had a similar idea but only generating thumbnail for the first few frames depending on the video length. So if two videos have the same length, they would have the same frame generating the thumbnail, which would then be compared.

It's been on my TODO list for a while, so if nobody does it in the next few years, perhaps I'll have a go at it.

strazto commented 3 years ago

Check out https://github.com/matthewstrasiotto/videoduplicatefinder

dupeguru has a better interface, but this does exactly what you want.

I'm actually going to raise a request here, to see if the .dupeguru format could be modified to allow for thumbnail indirection, so that a separate thumbnail preview could be nominated for a given result, as dupeguru has really excellent features for selecting, filtering & prioritizing dupes that I don't really want to reimplement in my project, and I'd rather just export to dupeguru and lean on this to do the work