Open micahsnyder opened 3 years ago
@abonander Friendly ping
I guess I don't specifically note it anywhere, but this crate was never intended to reproduce the output of other libraries. With the exception of Blockhash, none of the other image hashing algorithms have an exact specification so they may differ in various implementation details.
It looks like imagehash
uses a different DCT size, multiplying the hash width and height by a factor of 4 whereas img_hash
multiplies by a factor of 2. The blog post also appears to use a factor of 4 so this was probably just an implementation mistake on my part, but changing it now would be a breaking change as it would make existing hash values non-reproducible.
If changing that and using Median instead of Mean doesn't give you the values you're looking for, I dunno what else there would be to try. The rest of the imagehash
implementation looks identical, even using Lanczos (which imagehash
aliases to ANTI_ALIAS
).
With how lossy and nonlinear the whole image hashing process is, debugging why two implementations don't exactly produce the same output seems like a fool's errand to me, honestly.
It could also be that the image
crate uses slightly different values for reducing RGB pixels to Luma, although I think they use the sRGB standard. Or that scipy
uses different normalization factors for their DCT than rust-dct
.
Thanks for the analysis @abonander! I didn't notice or try adjusting the DCT size. I'll give it a try as soon as I'm able to switch tasks.
With how lossy and nonlinear the whole image hashing process is, debugging why two implementations don't exactly produce the same output seems like a fool's errand to me, honestly.
I too am concerned that there may still be minor differences across different image types resulting in different values. I'm working with my teammate to see if it would be feasible for his team to switch to use a CLI tool based on your library, rather than using imagehash
. That feels less error prone overall. But I haven't yet received his approval. Maybe I should ask you first -- "Is this a project you intend to maintain long term? Would you be comfortable with people depending on this in production environments?"
It's relatively stable but I dunno about depending on it in production. I unfortunately don't have time to maintain it these days.
I'm hoping that I can use your crate to generate phash hashes but I've been having some trouble with this. My current test program looks like this.
I have a teammate that generated a large collection of perceptual hashes. I want to reproduce those hashes for the same files using your library. I tried some different variations with your API and couldn't get it to work.
My next step was to verify that I could at least reproduce those hashes with a different pHash library.
Initially I thought that my teammate was using the
ph_dct_imagehash()
API from the original (GPL)pHash
project. I customized one of the pHash project example programs to print the DCT pHash for a single file. I was unable to generate hashes withimg_hash
that match the hashes created bypHash
'sph_dct_imagehash()
.I spoke with the teammate and learned that they're actually using Johannes Buchner's Python-based (BSD)
imagehash
library. Specifically, he's using Johannes' DCTphash()
function. To test it, I made this prototype to create phash() hashes with imagehash. Unfortunately, I was also unable to useimg_hash
to produce hashes that match those generated by Johannes'imagehash
'sphash()
function.I suppose you're primarily interested in reproducing the algorithms from the original pHash library (without reading their GPL source). But my motivation is to calculate hashes in Rust that match those that my teammate generated using Johannes Buchner's
phash()
function. I'm wondering if you'd be able to help me use your library to do this.One thing I'm thinking is that maybe
img_hash
should provide a way to use Median instead of Mean for the hash algorithm. From your documentation:But if you read the comments it says that pHash actually uses Median. Johannes' library also appears to use Median for
phash()
(though he uses Mean forphash_simple()
, but we're not usingphash_simple()
).To try to test using Median instead, I made an effort to add
HashAlg::Median
toimg_hash
: https://github.com/micahsnyder/img_hash/commit/6f5f603f7a218239f52d4ebbc806b44cca212bdc I'm not certain that my Median code is correct, which is why I haven't submitted a PR to you. And my overall test program usingimg_hash::HashAlg::Median
still didn't result in matching hashes.Would you be willing to look at Johannes'
phash()
implementation and compare and help me figure out what I'm doing wrong.