elisemercury / Duplicate-Image-Finder

difPy - Python package for finding duplicate or similar images within folders
https://difpy.readthedocs.io
MIT License
421 stars 65 forks source link

Erroneous results on particular image set #37

Closed MarcG2 closed 1 year ago

MarcG2 commented 1 year ago

I've been testing various image sets trying to isolate a bug and I got weird results on this one. There are no duplicates or similar images in this set. Similarity was to high. For example, the first result detected 32 duplicates with many of the files being listed more than once.

difPy output.zip

The image set can be downloaded here since it's to big to post. https://drive.google.com/file/d/1pbl7SttHF-mB35V1Q5ehj6A5wCb68o3B/view?usp=sharing

elisemercury commented 1 year ago

Hi @MarcG2, Thanks a lot for your feedback! v2.4.3 has just been released which includes a major bug fix detected in v.2.4.2. Therefore, please update to the newest version as this will likely fix your issue. If the issue should still persist, please let me know and I'll have a detailed look at it! Again thanks and all the best, Elise

MarcG2 commented 1 year ago

I just tested out the new version on the same set of images. It appears that the duplicate detection algorithm yields many false positives on grayscale images, such as pages from a manga comic. Is this expected behavior?
I haven't dug into the code since I don't use python. I can try changing some of the search parameters. What's the default value for px_size?

elisemercury commented 1 year ago

Hi @MarcG2, I double checked your issue again by using your data set, and indeed, I found there was an issue occurring when decoding certain types of black and white images, like you mentioned. The issue should now be fixed with v2.4.4, since I adjusted the decoding algorithm of difPy, which now correctly decodes all image types. Please let me know if you should stumble upon any other issue and I'll be happy to help! Again thanks a lot for your input and all the best, Elise