FineUploader / fine-uploader

Multiple file upload plugin with image previews, drag and drop, progress bars. S3 and Azure support, image scaling, form support, chunking, resume, pause, and tons of other features.
https://fineuploader.com
MIT License
8.19k stars 1.87k forks source link

13 - Duplicate Image Detection #1210

Closed feltnerm closed 10 years ago

feltnerm commented 10 years ago

Attempt to detect duplicate image uploads client-side before upload.

Concerns are performance when iterating over the pixels of large (hi-res) image files, but that may be able to be mitigated by shrinking the image.

rnicholus commented 10 years ago

Let's consider creating a new repo that tackles this problem. We can also blog about using the new library with Fine Uploader on the Fine Uploader blog.

rnicholus commented 10 years ago

We need to define the API for this new module/library. Will schedule a meeting to discuss internally.

rnicholus commented 10 years ago

I think it is prudent to develop this as a new repo outside of Fine Uploader.

Here are my current thoughts on this library (subject to editing):

Use Cases

Features

API

Implementation/Algorithms

Questions

feltnerm commented 10 years ago

Entirely client-side

This should be an end-goal, definitely. Might be an easy win to make a CommonJS exported function that deals with raw binary data (Uint8Array, Buffer, etc.), and then build client-side functionality (canvas->buffer, image tag->buffer, etc.) on top of that.

API

module.exports = function(threshold, image_1, image_2[, ..., image_n]) {

}

Implementation / Algorithms

Image Comparison

Perceptual Hash

A hashing function which will return hashes that are "close" to each other if the features of the hashed object are similar.

Histogram Matching

Not entirely sure what is going on here. Somehow the image data is transformed into a histogram (similar to Fourier transform maybe?). Then you can subtract histograms from each other to detect similarity/difference.

Haar Wavelet

Again, not really sure what this is, but I'm seeing it pop-up a lot on the literature surrounding dup. image detection.

Storing Results

k-d tree

Spinning off of the tree idea proposed by Ray, I found about a k-d tree (a k-dimensional tree). Since each pixel is composed of 4 "vectors" (R,G,B, and A) we would say each vector is a dimension in the k-d tree. Seems to have good efficiency.

Bloom Filter

A bloom filter could be used to store a set of the unique images from the user, and then we can test whether newly added images are within that set (within a certain probability of a false positive).

Pros: efficient Cons: can only detect if something is NOT in the set, or there in the set within a certain probability

/cc'ing @aswenson and @uriahcarpenter to see if they have any ideas :)

feltnerm commented 10 years ago

/cc'ing @nmall as well

feltnerm commented 10 years ago

i put together a simple jsbin that compares two images, pixel by pixel, in the browser. it's fairly speedy, surprisingly. obviously would slow down as more files are added. The comparison (right now) is only between two images, not against the set of all inputted images.

Feel free to edit and/or provide suggestions.

Since the image data is now in a more malleable form, I think next I'd like to create a way to experiment with different comparison algorithms and do some profiling of the different algos.

starting point?

rnicholus commented 10 years ago

It's possible that I'm missing something (I didn't have time to look over the code closely) but it seems like all of this can be done in a very small amount of code, and FileReader shouldn't be necessary.

To compare two images, pixel by pixel, all you need to do is:

  1. Get an object URL for each Blob
  2. Convert those to 2 <img> elements
  3. Draw the <img> elements onto 2 <canvas> elements (you can resize them as part of this process).
  4. Iterate over the ImageData for each image (from the canvas' associated RenderingContext) and stop when you reach the last subpixel or find a difference.
feltnerm commented 10 years ago

Ah yes, you're correct. I knew something was up when I was using FileReader. URL.createObjectUrl is what was needed to Blob -> <img>.

Updated, and removed unused code: http://jsbin.com/nosamu/23

rnicholus commented 10 years ago

Phew. Seems like a lot of work. Maybe in another repo, in another life.