elisemercury / Duplicate-Image-Finder

difPy - Python package for finding duplicate or similar images within folders
https://difpy.readthedocs.io
MIT License
421 stars 65 forks source link

Refactoring - Optional Merge #45

Closed qtc-de closed 1 year ago

qtc-de commented 1 year ago

Hi Elise :wave:

first of all, cool idea! I recently needed to compare a large chunks of images and your approach for comparing them worked pretty well :+1:

That being said, in the current implementation it is rather slow. Comparing larger chunks of images (15000+) takes a while. Moreover, you use a lot of different dependencies where some of them are quite large (e.g. opencv). This makes it difficult to install the tool in specific environments like within a Docker container.

Since I probably need to compare images in future again, I thought of improving these issues. This pull request provides the results. Before talking about the changes, let me apologize for the huge pull request. I actually do not like larger pull requests for my own repos and prevent from doing them to other persons as well. However, the dependency changes and especially the multiprocessing required a larger restructuring of your tool. Therefore, I totally understand if you do not want to merge the changes. In this case, I'm fine with maintaining a fork of your repository that provides an alternative implementation. Just decide as you like :)

Here is a brief summary of the changes I made:

  1. Make a clearer cut between CLI and library. The CLI script is now contained in /bin/difpy, while the code in /difPy/difPy.py only contains the library implementation.
  2. Reduce dependencies. The whole technique you describe can be implemented using numpy and Pillow. This makes it possible to create a Docker container running difPy that has only 161MB. Before, with opencv, we were around 1.2GB.
  3. Add multiprocessing. Work can now be distributed between different cores, which should speed up the operation quite a bit for larger image sets.
  4. Add a fast compare option. When image A is similar to image B, one probably does not want to compare B to other images, but is fine with only comparing A with others from here. Sure, this may misses some edge case duplicates, but in most situations it should be fine and provides a huge speedup for the operation.
  5. Change the command line layout. Feels now more intuitive (at least to me :D)
  6. Change the output format. The output format is still JSON based, but does not include much statistic information now. The regular end user is probably not that interested on when a comparison took place, but more on the actual comparison result. The new reduced output format should be easier to read / parse.
  7. Add a Dockerfile for building a container running difpy.

As I said, many changes. Just think about whether you want to merge or whether we keep these changes in a separate fork. I'm fine with both approaches :wink:

Best Tobias

elisemercury commented 1 year ago

Hi Tobias!

Thank you so much for your feedback and your effort in making difPy better and more efficient! I really appreciate the effort you invested in this. I had a quick look at your changes and they seem very clean at first sight. I also appreciate a lot that you adjusted the README to your changes. Let me go through the code more in detail and test the changes in the next few days. After that, I'll decide whether we can merge and deploy them.

Again, thank you very much for this!

All the best, Elise