arsenetar / dupeguru

Find duplicate files
https://dupeguru.voltaicideas.net
GNU General Public License v3.0
5.35k stars 415 forks source link

decrease comparision complexity N^2 -> N*logN #273

Open lambder opened 9 years ago

lambder commented 9 years ago

I'd like to run the application on huge picture collection (1 million images). That would require 10^12 comparisions. I'd like to propose improvement which would require 10^7 (N*logN) comparisions.

The sketch of the algorithm:

  1. for every picture when "blockifing" it, calculating SHA1 or MD5 (complexity N)
  2. sort the database (I suppose the database can have index already) (complexity NlogN)
  3. scan the database in the hash order, stop when the two hashes are identical and do raw comparision (that should be rare, if we have hash collision)

I'd happy to implement it but would ask you where is the code responsible for storing the "blockified" images in database and where is the code which runs comparision phase.

ghost commented 9 years ago

That would be great, but we would lose "fuziness" in comparison, so only blocks that match exactly could be matched (no threshold possible).

It would certainly be useful in cases like yours, but it couldn't replace the current algorithm. We'd have to create a new scan type.

The logic is at https://github.com/hsoft/dupeguru/blob/master/core_pe/matchblock.py

ghost commented 9 years ago

I might not have expressed my enthusiasm properly in my previous comment: it would be awesome :)

lambder commented 9 years ago

Ah, I see. How about , having fuzzyness treshold, caclulate block so the colors would be rounded down to some level. Eg. no fuzzyness = the colors in block are not changed, fuzzyness = 10% the RGB colors would be rounded to nearest point in RGB space (for 10% there would be 10 points, 256/10 ≈ 25) so color (123, 11, 210) would be rounded to cal (100, 0, 2, 200). this way we would push similar images together. In practice it would be better to flatten the colors in HSV rather than RGB space.

lambder commented 9 years ago

Since, we are pixelating the images (blocks) the differences can be calculated localy (pixel level) as opposite to whole image level. So rather than calculating avgdiff in comparision phase. Reduce the value doman size on cashe insert. Or is avgdiff something we do really need? What is the real life use case? I imagine a picture which is slightly shifted or rotated should be considered duplicate as opposite to same picture but with a black bird in the top right corner.

lambder commented 9 years ago

This approach would have additional benefit of better paraleleisation, as one qould have to compare very short (local) sequences of same hashes.

ghost commented 9 years ago

I developed this algo without thinking too much about it, but it happens to be rather efficient to find "double shots" (different shots of the same scene).

What you suggest, making the blockifier have a user-defined "color resolution" is a good idea, but it has two major problems that I see upfront:

  1. Our cache gets much larger. We have to cache blocks for every picture/resolution pair.
  2. We lose many duplicates, I explain:

Let's say we have two pictures that differ by onle one RGB value of one pixel. With your proposition, if we're unlucky enough, it's possible that this one pixel difference is just enough so that one picture is on one "side of the hash fence" and the other picture is on the other side of that fence. We would lose those duplicates because the hashes wouldn't be the same.

Of course, when speed is important, we might not care so much about missing those duplicates. That's why I think we're better off offering this new algo as an alternative.

lambder commented 9 years ago

Hi,

@1 The cache can be very small as we no longer need to hold blocks (hashes are enough). And in fact the cache could be in regular memory dict. Only when the two hashes are identical (rare situation) we can rerun the block calculation and additionally run your current implementation to calculate avg color distance.

@2 Since we are fuzzing the one pixel will be averaged over larger square. It would be desirable to pixelate with shift so the division will not stick to the fences.

ghost commented 9 years ago

(These users, 1 and 2, must have a lot of notifications on Github...)

About number two, of course, most of the time, averaging will group them together. It's just in the nature of a hashing algo that some neighbors, even very close, will be split apart (in the same way that natural rounding split 4 and 5 very far apart).

Maybe that having two hashes using a different "rounding" algorithm could do the trick and that a match on either of the two hashes results in a match (is that what you meant by "pixelate with shift"? I didn't understand that part), but otherwise, it's inevitable that we'll lose some matches.

Also, w.r.t number 1, the reason why the cache isn't in memory isn't really because of the size of the blocks, but rather to save us some processing in future scans.

lambder commented 9 years ago

Ah, I see your point. There are 2 kinds of fuzzing. First is pixelization (or down scaling the image). By shifting I was thinking to pixelate with offset so pixels are not sensitive to location in relation to the 'fence'. The second type of fuzzing, and that is the equivalent of your average image distance, is reduction of color space. E.g when we operate in HSV space we effectively are bucketing color values. And if I understand your point the problem is that we sometimes will classify some pixel (or block) to a different bucket even though the colors are really close. I like the idea of shifting/offsetting the buckets. But that would create not 2 but 8 hash combinations. Since there are 2^3 possibilities of how a pixel (block) can be classified to a bucket. (lower/higher H boundary, V boundary and S bonduary or any combination of these 3) I'm going to build prototype in clojure as I don't feel fluent in python, just to experiment with it.

ghost commented 9 years ago

Yep, I think we understood each other (although I didn't realize we would need 8 hashes). In any case, once the algo is done, it should be easy to compare its efficiency with the old algo in the "real world".

lambder commented 9 years ago

I have been reading on the topic last night. Something like ANN - Approximate Nearest Neighbor can be useful. In principle it uses special, locality aware hashing. http://web.mit.edu/andoni/www/papers/cSquared.pdf

Eucaly commented 8 years ago

for fuzzy matching on audio and image files, below reference might be useful

I will add some more reference on "fuzzy file name" to issue #223