Closed feltnerm closed 10 years ago
Let's consider creating a new repo that tackles this problem. We can also blog about using the new library with Fine Uploader on the Fine Uploader blog.
We need to define the API for this new module/library. Will schedule a meeting to discuss internally.
I think it is prudent to develop this as a new repo outside of Fine Uploader.
Here are my current thoughts on this library (subject to editing):
<img>
, <canvas>
, or image Blob
s to be passed in for analyzation. Promise fulfilled with an array, each element containing an array of detected identical/similar images.Entirely client-side
This should be an end-goal, definitely. Might be an easy win to make a CommonJS exported function that deals with raw binary data (Uint8Array, Buffer, etc.), and then build client-side functionality (canvas->buffer, image tag->buffer, etc.) on top of that.
module.exports = function(threshold, image_1, image_2[, ..., image_n]) {
}
A hashing function which will return hashes that are "close" to each other if the features of the hashed object are similar.
Not entirely sure what is going on here. Somehow the image data is transformed into a histogram (similar to Fourier transform maybe?). Then you can subtract histograms from each other to detect similarity/difference.
Again, not really sure what this is, but I'm seeing it pop-up a lot on the literature surrounding dup. image detection.
Spinning off of the tree idea proposed by Ray, I found about a k-d tree (a k-dimensional tree). Since each pixel is composed of 4 "vectors" (R,G,B, and A) we would say each vector is a dimension in the k-d tree. Seems to have good efficiency.
A bloom filter could be used to store a set of the unique images from the user, and then we can test whether newly added images are within that set (within a certain probability of a false positive).
Pros: efficient Cons: can only detect if something is NOT in the set, or there in the set within a certain probability
/cc'ing @aswenson and @uriahcarpenter to see if they have any ideas :)
/cc'ing @nmall as well
i put together a simple jsbin that compares two images, pixel by pixel, in the browser. it's fairly speedy, surprisingly. obviously would slow down as more files are added. The comparison (right now) is only between two images, not against the set of all inputted images.
Feel free to edit and/or provide suggestions.
Since the image data is now in a more malleable form, I think next I'd like to create a way to experiment with different comparison algorithms and do some profiling of the different algos.
starting point?
It's possible that I'm missing something (I didn't have time to look over the code closely) but it seems like all of this can be done in a very small amount of code, and FileReader
shouldn't be necessary.
To compare two images, pixel by pixel, all you need to do is:
Blob
<img>
elements<img>
elements onto 2 <canvas>
elements (you can resize them as part of this process).ImageData
for each image (from the canvas' associated RenderingContext
) and stop when you reach the last subpixel or find a difference. Ah yes, you're correct. I knew something was up when I was using FileReader. URL.createObjectUrl
is what was needed to Blob
-> <img>
.
Updated, and removed unused code: http://jsbin.com/nosamu/23
Phew. Seems like a lot of work. Maybe in another repo, in another life.
Attempt to detect duplicate image uploads client-side before upload.
Concerns are performance when iterating over the pixels of large (hi-res) image files, but that may be able to be mitigated by shrinking the image.