AkshDesai04 / PyCompare

This Python project aims to efficiently compare large datasets of images to identify duplicates
MIT License
1 stars 2 forks source link

Image Hashing #27

Closed GrayHat12 closed 1 week ago

GrayHat12 commented 1 month ago

Hey, I came across this repo through reddit today and it peeked my interest. I had a few ideas after going through a lot of articles on image comparison for duplicates. I'll link the sources below. I have implemented some changes to the codebase including some refractoring and restructure.

I have implemented a basic pHash Algorithm to create a hash for each image which can be parsed to string and the difference between two hashes represents the Hamming Distance between them. In simple terms, if (hash1 - hash2) == 0 images are exact copies. Difference between 2 hashes can be in the range 0-1 inclusive.

Please go through the code and let me know if some improvement is required or if I missed something.

Sources :

AkshDesai04 commented 1 month ago

Hey, thank you for the PR and your efforts. I would like to request a small change in the PR.

The project is supposed to also compare and find duplicates even when 2 images are same but of different resolutions (lets say 720p and 4k images for example) The PR provided compares the images in their actual state so this wont be found. My implementation was to implement proxies. converting all images to 100*100 would remove resolution and compression issues. You can propose an alternative method or, according to me, we could create hashes of the proxy objects.

Thank you for your contribution Happy coding :)

GrayHat12 commented 1 month ago

This PR does support finding duplicate images.

For example :

import json
from dataclasses import asdict
from pycompare import get_duplicates

if __name__ == "__main__":

    duplicates = get_duplicates("/home/grayhat/desktop/github/PyCompare/sample")

    with open("duplicated.json", "w+") as f:
        json.dump([[asdict(metadata) for metadata in group] for group in duplicates], f, default=str)

The folder here contains 3 images for this test

The get_duplicates function is supposed to return a list of duplicate image groups

I am attaching the output of this test run here

[
    [
        {
            "width": 4624,
            "height": 3468,
            "channels": 3,
            "size": 7453282,
            "extension": ".jpg",
            "exiftags": {
                "Image ImageWidth": "3468",
                "Image ImageLength": "4624",
                "Image ImageDescription": "",
                "Image Make": "OnePlus",
                "Image Model": "OnePlus Nord CE 2",
                "Image Orientation": "0",
                "Image XResolution": "72",
                "Image YResolution": "72",
                "Image ResolutionUnit": "Pixels/Inch",
                "Image Software": "MediaTek Camera Application",
                "Image DateTime": "2024:02:01 17:49:41",
                "Image YCbCrPositioning": "Co-sited",
                "Image Tag 0x0220": "0",
                "Image Tag 0x0221": "0",
                "Image Tag 0x0222": "0",
                "Image Tag 0x0223": "0",
                "Image Tag 0x0224": "0",
                "Image Tag 0x0225": "",
                "Image ExifOffset": "450",
                "GPS GPSVersionID": "[2, 2, 0, 0]",
                "GPS GPSLatitudeRef": "N",
                "GPS GPSLatitude": "[22, 36, 5033/2500]",
                "GPS GPSLongitudeRef": "E",
                "GPS GPSLongitude": "[72, 49, 37747/5000]",
                "GPS GPSAltitudeRef": "0",
                "GPS GPSAltitude": "0",
                "GPS GPSTimeStamp": "[12, 3, 45]",
                "GPS GPSProcessingMethod": "[65, 83, 67, 73, 73, 0, 0, 0, 67, 69, 76, 76, 73, 68, 0, 0, 0, 0, 0, 0, ... ]",
                "GPS GPSDate": "2024:02:01",
                "Image GPSInfo": "1544",
                "Thumbnail Compression": "JPEG (old-style)",
                "Thumbnail Orientation": "Horizontal (normal)",
                "Thumbnail XResolution": "72",
                "Thumbnail YResolution": "72",
                "Thumbnail ResolutionUnit": "Pixels/Inch",
                "Thumbnail JPEGInterchangeFormat": "2036",
                "Thumbnail JPEGInterchangeFormatLength": "0",
                "Thumbnail YCbCrPositioning": "Co-sited",
                "EXIF ExposureTime": "967/500000",
                "EXIF FNumber": "17/10",
                "EXIF ExposureProgram": "Unidentified",
                "EXIF ISOSpeedRatings": "100",
                "EXIF SensitivityType": "Unknown",
                "EXIF RecommendedExposureIndex": "0",
                "EXIF ExifVersion": "0220",
                "EXIF DateTimeOriginal": "2024:02:01 17:49:41",
                "EXIF DateTimeDigitized": "2024:02:01 17:49:41",
                "EXIF OffsetTimeOriginal": "+05:30",
                "EXIF ComponentsConfiguration": "YCbCr",
                "EXIF ShutterSpeedValue": "4507/500",
                "EXIF ApertureValue": "153/100",
                "EXIF BrightnessValue": "131/10",
                "EXIF ExposureBiasValue": "-10",
                "EXIF MaxApertureValue": "153/100",
                "EXIF MeteringMode": "CenterWeightedAverage",
                "EXIF LightSource": "other light source",
                "EXIF Flash": "Flash did not fire, compulsory flash mode",
                "EXIF FocalLength": "473/100",
                "EXIF MakerNote": "[123, 34, 110, 105, 103, 104, 116, 70, 108, 97, 103, 34, 58, 34, 48, 34, 44, 34, 110, 105, ... ]",
                "EXIF UserComment": "",
                "EXIF SubSecTime": "885",
                "EXIF SubSecTimeOriginal": "885",
                "EXIF SubSecTimeDigitized": "885",
                "EXIF FlashPixVersion": "0100",
                "EXIF ColorSpace": "sRGB",
                "EXIF ExifImageWidth": "3468",
                "EXIF ExifImageLength": "4624",
                "Interoperability InteroperabilityIndex": "R98",
                "Interoperability InteroperabilityVersion": "[48, 49, 48, 48]",
                "EXIF InteroperabilityOffset": "1943",
                "EXIF SensingMethod": "0",
                "EXIF ExposureMode": "Auto Exposure",
                "EXIF WhiteBalance": "Auto",
                "EXIF DigitalZoomRatio": "1",
                "EXIF FocalLengthIn35mmFilm": "0",
                "EXIF SceneCaptureType": "Standard",
                "JPEGThumbnail": "b''"
            },
            "filename": "/home/grayhat/desktop/github/PyCompare/sample/sampleb",
            "filepath": "/home/grayhat/desktop/github/PyCompare/sample/sampleb.jpg",
            "hash": "0x00010x0010a9a46b92aeda264d8970bf41f81e6ba4fc0e3e20f36073c3dcdaeaeb81b4320a"
        },
        {
            "width": 360,
            "height": 360,
            "channels": 3,
            "size": 92549,
            "extension": ".jpg",
            "exiftags": {},
            "filename": "/home/grayhat/desktop/github/PyCompare/sample/samplea",
            "filepath": "/home/grayhat/desktop/github/PyCompare/sample/samplea.jpg",
            "hash": "0x00010x0010a9a56b92ae5a264d8970bf41f81e6ba4fc0e3e30f16063c7d8daeae9a1b4320e"
        }
    ]
]

Here we can see there is one duplicate group containing sampleb and samplea images which are duplicate images but different resolutions. This json contains the metadata of the images along with the hash but you only realistically need the hash to check if images are duplicate.