arsenetar / dupeguru

Find duplicate files
https://dupeguru.voltaicideas.net
GNU General Public License v3.0
5.48k stars 418 forks source link

Dupeguru PE: Improve algorithm speed #293

Open ghost opened 9 years ago

ghost commented 9 years ago

As described in the documentation, the matching algorithm has scope for significant improvements. The analysis phase produces 675 (= 3x15x15) 8-bit numbers for each image. By sorting the analysed pictures, some very bad matches can be discarded almost immediately. After that, the comparisons only have to take place over a limited range of the ordered set. And of course, by doing the comparisons in order, half the comparisons can be skipped anyway.

ghost commented 9 years ago

Pull requests to this effect are welcome.

patrickatamaniuk commented 8 years ago

@john005 can you elaborate the sorting criteria for me? I am interested in this.

silvavlis commented 8 years ago

Since it doesn't look as if @john005 will elaborate the sorting criteria as @patrickatamaniuk is asking for, let me try to analyze the proposal.

Sorting criterion

The key point of the whole idea is the less elaborated part of the proposal :disappointed: Which alternatives do I see as sorting criterion? In fact only one:

Add the values of all 675 bytes to a single number and use it for comparisons. Think about it, the comparison score being used to test against the threshold results from adding all single differences. But the result is the same if you add all the 675 bytes of a picture to a single number and then compare! That's basic arithmetic (although I needed some minutes to see it myself): (x1-y1)+(x2-y2) = (x1+x2)-(y1+y2).

Algorithm

Once the sorting criterion is clear, the bright of the proposal comes to shine. Let's see how it would work based on this documentation:

  1. Blockification phase: no caching of 675 bytes per picture required, just the addition of all of them. Entries in the cache are sorted. Therefore this phase will take longer than before, but the comparison phase will become much faster.
  2. Comparison phase: No matter where the algorithm starts, it compares each picture with each picture after it until reaching the threshold and then it stops, since the comparison with each of all following pictures will result in a higher score.

Since the number resulting from the addition for each picture gives an idea of the light present in the picture (the higher the number -maximum is 172800-, the lighter the picture), I will call the value brightness.

Example

Let me show just a simplified example: Picture P1 has a brightness of 10 Picture P2 has a brightness of 15 Picture P3 has a brightness of 5 Picture P4 has a brightness of 12

Here they are sorted:

P3 5
P1 10
P4 12
P2 15

Using a threshold of 3:

Score P3 against P1 (5) > threshold (3) => :x: no match Score against all others will be bigger than threshold, no more comparisons required for P3.

No comparison of P1 against P3 required, since already done. Score P1 against P4 (2) < threshold (3) => :white_check_mark: match Score P1 against P2 (5) > threshold (3) => :x: no match

No comparisons of P4 against P3 and P1 required, since already done. Score P4 against P2 (3) = threshold (3) => :white_check_mark: match

No comparisons of P2 against P3, P1 and P3 required, since already done.

Done! With only 4 comparisons instead of N*N=16 :astonished: BTW, the worst case is 6 comparisons (P3 against P1, P4 and P2, then P1 against P4 and P2 and finally P4 against P2).

The credits for the great idea go to @john005, I'll accept the credits for the elaboration :stuck_out_tongue_winking_eye:

ghost commented 8 years ago

I think that the logic behind that is sound, interesting! It also has the merit of being relatively simple, thus easy to implement.

If someone is interested in implementing it, I'd be glad to assist that person in clearing hurdles.