AtlasOfLivingAustralia / image-service

Image repository and tiling services
https://images.ala.org.au
0 stars 17 forks source link

Prevent duplicate images #26

Closed nickdos closed 5 years ago

nickdos commented 8 years ago

Suggestion from @sadeghim is to store a checksum on each image and then to warn (flag) or error when a new image is uploaded that has the same checksum.

Would also be good to have a find duplicate images for existing images. How to handle historical duplicate images is another matter that would require further investigation.

davidbairdala commented 8 years ago

Hi Nick,

There are already two checksums stored for each image - MD5 and SHA1 hashes. There is also some duplicate image detection code as well, in the form of a duplicates image report (under the admin section, although it its quite expensive to run!). I spoke to Dave M a couple times about duplicate image handling, but no resolution was reached. It is certainly possible to detect a duplicate on ingest, and simply return the existing image instead of storing the image multiple times, although I think you'd might also want to track the link as a separate noun in the database.

You could easily extend the image details page to include a duplicates tab that did a lookup on the md5 and/or sha1 hashes also.

Actually, just thinking about it a bit more, all the image records are indexed by elasticsearch to enable fast searching, and the duplicate report could probably be rewritten to use the index instead of hitting the postgres database, which would make it much faster

Cheers, David.

nickdos commented 8 years ago

Hi Dave - good to hear from you. That info is really helpful, thanks.

nickdos commented 8 years ago

Issue #19 is related.