0x90d / videoduplicatefinder

Video Duplicate Finder - Crossplatform
1.94k stars 185 forks source link

[Question] Change the number of 'thumbnails' multiple times - I have to re-generate the thumbnails each time. It seems not logical #354

Open jeffward01 opened 1 year ago

jeffward01 commented 1 year ago

Issue

Each time a 'scan' occurs and the number of thumbnails change, the thumbnails are not 'stored' or 'remembered'. This results in a very long file scan time.

Please consider the following scenario

Scenerio

Action 1

Action 2

Action 3 (this is the important step)

Action 3 (this is the important step) Alternate version

Expected behavior


Question

Does this functionality exist?

Context --> Why I suggest this feature

0x90d commented 1 year ago

Graybyte values are saved. The thumbnails of the found duplicates are not saved between multiple scans. This is by design as these thumbnails can take a lot of space. But these thumbnails shouldn't affect scan speed as they're generated after scan is done.

jeffward01 commented 1 year ago

Graybyte values are saved. The thumbnails of the found duplicates are not saved between multiple scans. This is by design as these thumbnails can take a lot of space. But these thumbnails shouldn't affect scan speed as they're generated after scan is done.

Let me do some testing to see, because in my experience if I repeat the above steps - it takes days to complete a scan.

Perhaps I am mixing up some settings ad 're-scanning' the database so that the database id dumped then rescanned.

I will test and verify this, then report back to you on my findings either way 🙌


Question

The thumbnails of the found duplicates are not saved between multiple scans.

  1. Any interest in making these files optionally persist to a target location, and perhaps add a flush feature?

For example, I have at least 159,142 video files in my library haha. So this means each time I run a scan, it will need to generate 159,142 * thumbnailCount each time.

  1. Just for estimation reasons, do you happen to know the very roughly approximate size of thumbnail in kb? I'm just trying to how large the cache could grow.

Honestly tho, it can't be worse than JetBrains cache size 🙃

Maltragor commented 1 year ago

I have not tried the latest versions of VDF ..., but in principle it has been so far:

jeffward01 commented 1 year ago

Thank you @Maltragor for explaining that, that makes a lot of sense how it is “averaged out” to an even ratio like in the example you gave.

If I made a pull-request, do you think it would be helpful it the algorithm had a “memory” and would make some sort of adjustments to not “re-scan” entires?

the adjustments would be to refactor how it selected where the scans would take place by pre-setting slots essentially.

such as for example, if you have (3) thumbnails, let’s say the thumbnails are at: these positions:

• 5% mark • 50% mark • 75% mark

4 thumbnails: • 5% • 33% mark • 50% mark • 75% mark

2 thumbnails: • 5% mark • 50% mark

1 thumbnail: • 5%

As an example ^^.

I don’t see why it is necessary for it to “by ratio” re-pick the marks based on each number of thumbnails in an “even way”. If it has pre-determined slots like in my example it could be alot faster.

Questions:

1.) if I made a PR for this, would it be something that would be accepted, or does it logically break something?

2.) given the example of (2) 60 minute long movies that are identical, but each movie has a DIFFERENT 10 second exactly intro - in the current algorithm, would a duplicate be detected? My assumption is no, because it would not see the gray pixels in the first 10 seconds. Is this correct?