elisemercury / Duplicate-Image-Finder

difPy - Python package for finding duplicate or similar images within folders
https://difpy.readthedocs.io
MIT License
421 stars 65 forks source link

feature request: chunking of source folder #22

Closed ALCarter2 closed 1 year ago

ALCarter2 commented 2 years ago

Thank you for your library! Just giving a heads up that I edited one of your previous versions by adding an additional parameter that allows the src folder to be split into n chunks for processing. Scenario: I have image folders that contain over 50000 images in sequential time over.

For me, it is most likely that an image file is going to be a duplicate with other image files added around a similar time frame. Comparing against the entire 50000+ for each image took an enormous amount of time. So, I made it so that I could split the folder into chunks of 5000 (for example) and evaluate in sections. It also allowed me to restart from a position if I had to stop evaluation for some reason. There's a little more that I added to make it more robust (for example, for n+1 chunk would also include some amount of files from the previous chunk so that there would be some degree of overlap). Anyway, this worked out well for me and if you are still adding to this library then I found it to be very useful.

The route I took is not going to be as robust as going through EVERY image each time but in my personal tests, the performance was close enough and the time savings were significant! Cheers,

elisemercury commented 2 years ago

Dear @ALCarter2, Thanks a lot for your input and idea! Indeed, I agree that this feature can be very helpful and might signifcantly increase difPy's performance. Feel free to open a pull request with your version and I will be happy to review it. Again, thanks! All the best, Elise