Unable to generate pHash hashes that match other implementations

micahsnyder commented 3 years ago

I'm hoping that I can use your crate to generate phash hashes but I've been having some trouble with this. My current test program looks like this.

I have a teammate that generated a large collection of perceptual hashes. I want to reproduce those hashes for the same files using your library. I tried some different variations with your API and couldn't get it to work.

My next step was to verify that I could at least reproduce those hashes with a different pHash library.

Initially I thought that my teammate was using the ph_dct_imagehash() API from the original (GPL) pHash project. I customized one of the pHash project example programs to print the DCT pHash for a single file. I was unable to generate hashes with img_hash that match the hashes created by pHash's ph_dct_imagehash().

I spoke with the teammate and learned that they're actually using Johannes Buchner's Python-based (BSD) imagehash library. Specifically, he's using Johannes' DCT phash() function. To test it, I made this prototype to create phash() hashes with imagehash. Unfortunately, I was also unable to use img_hash to produce hashes that match those generated by Johannes' imagehash's phash() function.

I suppose you're primarily interested in reproducing the algorithms from the original pHash library (without reading their GPL source). But my motivation is to calculate hashes in Rust that match those that my teammate generated using Johannes Buchner's phash() function. I'm wondering if you'd be able to help me use your library to do this.

One thing I'm thinking is that maybe img_hash should provide a way to use Median instead of Mean for the hash algorithm. From your documentation:

http://www.hackerfactor.com/blog/?/archives/432-Looks-Like-It.html Krawetz describes a "pHash" algorithm which is equivalent to Mean + DCT preprocessing here.

But if you read the comments it says that pHash actually uses Median. Johannes' library also appears to use Median for phash() (though he uses Mean for phash_simple(), but we're not using phash_simple()).

To try to test using Median instead, I made an effort to add HashAlg::Median to img_hash: https://github.com/micahsnyder/img_hash/commit/6f5f603f7a218239f52d4ebbc806b44cca212bdc I'm not certain that my Median code is correct, which is why I haven't submitted a PR to you. And my overall test program using img_hash::HashAlg::Median still didn't result in matching hashes.

Would you be willing to look at Johannes' phash() implementation and compare and help me figure out what I'm doing wrong.

micahsnyder commented 3 years ago

@abonander Friendly ping

abonander commented 3 years ago

I guess I don't specifically note it anywhere, but this crate was never intended to reproduce the output of other libraries. With the exception of Blockhash, none of the other image hashing algorithms have an exact specification so they may differ in various implementation details.

It looks like imagehash uses a different DCT size, multiplying the hash width and height by a factor of 4 whereas img_hash multiplies by a factor of 2. The blog post also appears to use a factor of 4 so this was probably just an implementation mistake on my part, but changing it now would be a breaking change as it would make existing hash values non-reproducible.

If changing that and using Median instead of Mean doesn't give you the values you're looking for, I dunno what else there would be to try. The rest of the imagehash implementation looks identical, even using Lanczos (which imagehash aliases to ANTI_ALIAS).

With how lossy and nonlinear the whole image hashing process is, debugging why two implementations don't exactly produce the same output seems like a fool's errand to me, honestly.

It could also be that the image crate uses slightly different values for reducing RGB pixels to Luma, although I think they use the sRGB standard. Or that scipy uses different normalization factors for their DCT than rust-dct.

micahsnyder commented 3 years ago

Thanks for the analysis @abonander! I didn't notice or try adjusting the DCT size. I'll give it a try as soon as I'm able to switch tasks.

With how lossy and nonlinear the whole image hashing process is, debugging why two implementations don't exactly produce the same output seems like a fool's errand to me, honestly.

I too am concerned that there may still be minor differences across different image types resulting in different values. I'm working with my teammate to see if it would be feasible for his team to switch to use a CLI tool based on your library, rather than using imagehash. That feels less error prone overall. But I haven't yet received his approval. Maybe I should ask you first -- "Is this a project you intend to maintain long term? Would you be comfortable with people depending on this in production environments?"

abonander commented 3 years ago

It's relatively stable but I dunno about depending on it in production. I unfortunately don't have time to maintain it these days.

abonander / img_hash

Unable to generate pHash hashes that match other implementations #45