KilianB / JImageHash

Perceptual image hashing library used to match similar images
MIT License
407 stars 83 forks source link

Breaking hash values when moving to version 3.0.0 #34

Closed anatolyra closed 5 years ago

anatolyra commented 5 years ago

Hi,

I have tests that generate hash values and check for distance for several couples of images. They are there to guard for breaking changes in your lib when I move to a new version. When trying to move to version 3.0.0 from 2.1.1 these tests seem to break. The images get different hash values than what we expect. The last time you moved from version 1.x.x to version 2.x.x, you specified in the changelog that something changed in the algorithm and there are breaking changes that will impact the hash values. This time, I can't seem to find anything like that. Was this your intention? Or is it a bug?

Thanks!

KilianB commented 5 years ago

If memory serves me correctly

Version 3.0.0

https://github.com/KilianB/JImageHash/blob/bbb54245b0dd21df80878930ed40dcde926bddc3/src/main/java/com/github/kilianB/hashAlgorithms/AverageHash.java#L88-L98

Version 2.X.X

https://github.com/KilianB/JImageHash/blob/ec728bf5a388d744cd2772c13c52e5197ad62298/src/main/java/com/github/kilianB/hashAlgorithms/AverageHash.java#L85-L95

The new version uses a hashbuilder which basically is a byte array to create hash values to speed up hash creation. The performance difference is especially noticeable at high bit resolutions by some order of magnitude. We can simply do a bit of bit shifting instead of needing to create an entirely new big integer object at each shift. (BigInteger are immutable).

Sadly this resulted in the bit order to be reversed Version 2 000111 Version 3 111000 A bit of thought was put into keeping the order consistent but there wasn't a straightforward way to do this. (The difference hash concatenates hashes, therefore the order is even more mixed up).

Additionally the average hash algorithm line underwent a few changes. Instead of using the average grayscale (rgb/3), now luminosity is used.

Filters/Kernels were added which need to be accounted for in the algorithm id. The algorithm id changed since the Kernel[] array, even if empty, takes part in the hashCode

https://github.com/KilianB/JImageHash/blob/bbb54245b0dd21df80878930ed40dcde926bddc3/src/main/java/com/github/kilianB/hashAlgorithms/HashingAlgorithm.java#L224-L232

@anatolyra since we are already taking may I ask you what you are using this project for? Currently we are considering to promote this project to take part in my thesis and I am looking for use cases and anything that could help me to decide in which direction to take this library. Do we want more clustering, motion tracking, water marking, data embedding ... Taking a deeper look at the algorithms. e.g. does the triple precision of difference hash really increase the detection quality? Implementing a few new rotation invariant hashes according to papers. I am also looking at an in depth evaluation of all proposed algorithms to create a comparison table, which algorithm is suited for which situation?