elisemercury / Duplicate-Image-Finder

difPy - Python package for finding duplicate or similar images within folders
https://difpy.readthedocs.io
MIT License
449 stars 66 forks source link

Some images skipped #14

Closed tme-MetaSwitch closed 2 years ago

tme-MetaSwitch commented 2 years ago

I was surprised to find that difPy skipped my entire directory of jpeg images. I tracked this down to the call to imghdr.what(). This looks for the string "jpeg" or "exif" near the start of the file. These are not present in my images, so it seems that imghdr.what() cannot be relied upon.

I updated my local copy of the code to do this instead. I expect cv2.imdecode is much better at determining whether a file is a valid image or not.

        img_path = os.path.join(directory, filename)
        if not os.path.isdir(img_path):
            try:
                img = cv2.imdecode(np.fromfile(img_path, dtype=np.uint8), cv2.IMREAD_UNCHANGED)

[...] except Exception: pass

Again, a suggestion and not a request.

elisemercury commented 2 years ago

Hi tme-MetaSwitch

Thanks a lot for your input! After doing some research myself, it indeed seems that cv2.imdecode is the more reliable way of checking whether a file is a valid image. Therefore, this improvement has been considered with the new release v2.2.

Again, thanks and all the best, Elise