elisemercury / Duplicate-Image-Finder

difPy - Python package for finding duplicate or similar images within folders
https://difpy.readthedocs.io
MIT License
420 stars 65 forks source link

Fail to detect pictures compressed to a lower resolution #81

Closed qiulang closed 7 months ago

qiulang commented 8 months ago

I find it fails to detect pictures compressed to a lower resolution against the original pictures.

Very easy to duplicate. Here is a high-resolution picture. (picture size 1.5 mb)

14541697714511_ pic_hd

I compress it to a low-resolution picture (picture size 572kb)

14541697714511_ pic_hd

But it fails to detect them as duplicated.

radry commented 8 months ago

Different compression means the image is not a duplicate. Duplicate will only detect identical copies.

You have to set similarity to "similar" or a low MSE (int) value.

qiulang commented 8 months ago

Hi thanks for the reply I set similarity to "similar" and got some results. But the output is hard to interpret, e.g.

What do those numbers in the front, e.g. "167309613050256579586006204721460896834" mean and what do mse numbers mean?

Thanks.

{167309613050256579586006204721460896834: {'location': 'img/14541697714511_.pic_hd.jpg', 'matches': {331450324146141965362780397418259729137: {'location': 'img/14541697714511_.pic.jpg', 'mse': 3.0162666666666667}, 155853290248054212903419878371039499669: {'location': 'img/14541697714511_.pic_hd.jpeg', 'mse': 2.9806666666666666}}}, 92442402207312713123420905016029999381: {'location': 'img/dugong.jpeg', 'matches': {229799943347428917403049183489235913460: {'location': 'img/22164408-6BA4-4AB6-909E-42F082CE0D9C.jpeg', 'mse': 0.15653333333333333}}}, 269147745042971525953486893492139519471: {'location': 'img/B006E79C-D831-4ED3-9E36-CC5F4F0243E9.jpeg', 'matches': {270150075243259920654568129498975620068: {'location': 'img/1ADE1078-9C3A-4B5D-A810-EEFF28F58B08.jpeg', 'mse': 0.16146666666666668}}}, 74452427693317413832491932578110103003: {'location': 'img/we3.jpg', 'matches': {17178382869503454841239408487831050153: {'location': 'img/we32.jpg', 'mse': 30.981333333333332}}}}
radry commented 8 months ago

https://difpy.readthedocs.io/en/latest/usage.html#output

The number in front is a random ID.

The mse number says how similar the two images are. A lower number means more similar. In your example only the last one with mse 30.98 might be different.
You have to experiment and find a good treshhold for your data. When you set similarity to "similar" the default mse value is 50 but you can set similarity to any float or integer value, which will be the treshhold to decide if it's similar or not.

elisemercury commented 7 months ago

Thanks @radry for handling this question!