elisemercury / Duplicate-Image-Finder

difPy - Python package for finding duplicate or similar images within folders
https://difpy.readthedocs.io
MIT License
421 stars 65 forks source link

Slow to compare #31

Closed bavetta closed 1 year ago

bavetta commented 1 year ago

I tried using this on a folder of 100,000 1280x720 images and is exceptionally slow (about 12 seconds per image at the comparison stage). At this rate it's going to run for two weeks! Is there anything you can do to speed things up?

elisemercury commented 1 year ago

Hi @bavetta,

Thank you very much for your feedback!

Unfortunately, difPy was not initially designed and tested to handle this amount of data, therefore please excuse any inconvenience arising from that. One thing I would like to make you aware of, is that, in the one folder case, the required time per image decreases over time, since the difPy algorithm compares the current image only to the images that are preceding it in the directory. Therefore, the 12 seconds per image might apply at the beginning of the process, but will decrease over time and your program might speed up.

Also, considering the time it would take a human to compare 100k images - I would still claim 2 weeks to be a fairly acceptable time, though I fully understand your concern.

The difPy algorithm indeed requires a minimum amount of processing time, as it has to convert each image into a tensor, and then compress it prior to doing any comparison. But, I have two suggestions you could try, which might lead to an improvement of the overall processing time:

I hope this helps! In case you have any ideas on how to make the difPy algorithm faster or more compute-efficient, please feel free to let me know, or to directly apply the changes yourself and open a pull request to the repo. That would be of great help to the community!

Again thank you for your input and all the best, Elise