KilianB / JImageHash

Perceptual image hashing library used to match similar images
MIT License
397 stars 80 forks source link

Support for transparent images #38

Closed fbarrella closed 3 years ago

fbarrella commented 5 years ago

I'm having a unusual problem in my project where I'm getting the same hash to two different images. Those are the mentioned images: citroen mini The project is simply using the hash(BufferedImage image) method from the API and I'm not sure if there are different approaches to do that, but is there any possible solution to this problem? Thanks in advance!

KilianB commented 5 years ago

Hash collisions happen by design due to the fact that you map arbitrarily many images to a fixed length hash.

For example using java's default hashcode implementation converts the string Hash to a numeric value of 2241838. Just looking at 4 character strings, any of the following word have the exact same hashcode

This is also true for images. A solution would be to use a secondary hash function looking at different features of the image (e.g. average hash and perceptive hash) to confirm your classification. If they agree the image is most likely a duplicate.

The chain algorithm example show how you might approach this:

https://github.com/KilianB/JImageHash/blob/master/src/main/java/com/github/kilianB/examples/ChainAlgorithms.java

fbarrella commented 5 years ago

Thanks!!! But sadly I might say I'm having a bad time at trying to find a solution to this. Oddly enough, I've tried to use multiple of the available hashing methods, but always kept ending up with the same digits for both of 'em images. Even changing the "bit resolution" didn't worked; What intrigues me the most is the fact that, as you can see from the images uploaded in my prior comment, they are indeed different images. I would love to find a way to show that for the hash methods as well, hahaha. There are any other possibilities in the game? Anyway, thank you very much for the help!

fbarrella commented 5 years ago

I'm also going to leave here the actual piece of code I'm using to test the similarity between the images! Maybe I'm doing something I can't see right now! Would appreciate some insights!

@PostMapping(value = "/v1.0/hashTest")
public ResponseEntity getImageHashTest(@RequestParam(name="file") MultipartFile file,
                                       @RequestParam(name="file2") MultipartFile file2){
    SingleImageMatcher matcher = new SingleImageMatcher();
    Map<String, String> hashMap = new HashMap<>();

    try {
        BufferedImage image = ImageIO.read(file.getInputStream());
        BufferedImage image2 = ImageIO.read(file2.getInputStream());

        matcher.addHashingAlgorithm(new AverageHash(8), 0.4);
        matcher.addHashingAlgorithm(new AverageHash(32), 0.4);
        matcher.addHashingAlgorithm(new AverageHash(64), 0.4);

        matcher.addHashingAlgorithm(new PerceptiveHash(32), 0.4);
        matcher.addHashingAlgorithm(new PerceptiveHash(64), 0.4);

        matcher.addHashingAlgorithm(new MedianHash(32), 0.4);
        matcher.addHashingAlgorithm(new MedianHash(64), 0.4);

        matcher.addHashingAlgorithm(new DifferenceHash(64, DifferenceHash.Precision.Simple), 0.4);
        matcher.addHashingAlgorithm(new DifferenceHash(32, DifferenceHash.Precision.Triple), 0.4);

        if(matcher.checkSimilarity(image, image2))
            hashMap.put("similarity", "yes");
        else
            hashMap.put("similarity", "no");

        return ResponseEntity.ok(hashMap);
    } catch (IOException e) {
        e.printStackTrace();
    }

    return ResponseEntity.noContent().build();
}
KilianB commented 5 years ago

I did some testing and indeed those images will result in the same hash no matter what you try. Upon further investigation the issue arises due to the alpha channel. The black parts of the image are solid black, the white parts simply have an opacity of 0. As far as the program is aware, computing the luminosity values only takes the rgb values into account which are the same for each pixel. Are there any guidelines how transparency should be regarded when calculating Y in the YCbCr color model? I assume that for this trivial case an alpha of 0 can be assumed as white, but this isn't entirely correct for every single use case. For now an ugly work around would be to replace the opaque pixels with a white color until I can figure out how to correctly compute luminosity. (Is there a formula how to handle alpha? Always assume white?)

KilianB commented 5 years ago

Yes, choosing a different hash method will not make a difference since the issue resides at the hash precalculation step. I see where you are coming from and this indeed is an issue. Semantically there isn't a valid solution I am afraid. We never know what color a missing pixel (invisible) will be. More often than not those pixels are displayed as white, as seen in the above images. I will add an option to let people choose how to handle transparency. This will fix your current issue.

fbarrella commented 5 years ago

That's awesome! I was actually going to come back with this exactly answer! Going through a little search, I've noticed how much the transparency affected the hash calculation by the API and ended up with the idea of simply modifying the original BufferedImage with an white background and then generating an hash over it. The only problem I see is when the actual image is an white png icon. Maybe we could iterate over this solution to get to a good place.

KilianB commented 5 years ago

Everything is pretty much implemented I just need to do some unit tests in order to ensure I didn't mess up anything else. The heavy lifting is done at the utility code repository.

From now on you can define:

   HashingAlgorithm aHasher = new AverageHash(64);
   //Define how to handle opaque pixel
   double alphaThreshold = 0;
   aHasher.setOpaqueHandling(Color.white,alphaThreshold);

   //Proceed as normal`

Will this suit you or do you have any other ideas? By default I will retain the old behavior to not break backwards compatibility. For strictly black and white images with transparent background we simply use an arbitrary color and handle both use cases.

fbarrella commented 5 years ago

Ok, if I got it right, the hasher will by default treat the image with a white background while also letting me choose another color/threshold if demanded, right? If so, it is amazing! It solves the problem as we can get even more preciser hashes! About the black and white w/ no background: what if you calculate the bg color over the luminance of the predominant image color? Maybe so we can avoid as much as possible making the user set the color manually!

KilianB commented 5 years ago

While refactoring the utility code I changed a few design decisions which takes a while longer than expected. I really wanted to get the new version released this night but sadly it will take a tiny bit.

fbarrella commented 5 years ago

Cool! Man, I would like to report a new ununsual case after the solution of adding a white bg to transparent backgrounds... For some reason my code resulted to generate equal codes (once again) when trying to hash those two following images using new PerceptiveHash(32) (respectively, one being the transparent png and the other being just a regular jpeg):

GLE 350D HIGHWAY - LATERAL 2 audi_vermelho

Would you please try to hash 'em so we can test if the anomaly isn't only at my side?