JohannesBuchner / imagehash

A Python Perceptual Image Hashing Module
BSD 2-Clause "Simplified" License
3.19k stars 329 forks source link

Specific video hashing results in imagehash vs goimagehash differ #133

Closed cooperdk closed 3 years ago

cooperdk commented 3 years ago

We are using imagehash and goimagehash to create a phash of video files.

(Please read my comment below as it is important)

We generate 25 screenshots with ffmpeg (with the same command) and combine them with pillow (7.2.2 seems to give the least hamming distance).

The combine is done without any additional modification, simply pasted by column and row in order left to right, top to down. Attached example.

We operate with BMP files in order to avoid any compression artifacts.

Our scripts extract video frames at EXACTLY the same time.

We get a hamming distance of 4 for this 72 second video, even though the images are similar, there are artifacts around high contrast edges in the Python script which we don't see in the go version and we're wondering if that is enough to give different hash results (we didn't try to grayscale the images or anything, yet).

I am attaching two sprites, one from each script. The video is from the Internet Archive. https://archive.org/download/donaldtrumpxxxparodydogfart/Donald%20Trump%20XXX%20Parody%20-%20Dogfart.mp4

The phash results are 82207fd3a17cbd16 (python imagehash) and 82207ed1a57cbf16 (goimagehash). That's a hamming distance of 4 which is quite a lot.

With many videos, we do get the same results. For example: https://archive.org/download/rebeccalordsexunderhotlightsxnxx.com/Rebecca%20Lord%20-%20Sex%20Under%20Hot%20Lights%20-%20XNXX.COM.mp4

Both videos are public domain and they are censored so there is no explicit material.

Do you have any idea if Pillow or go imaging (which is used on that script) could be at fault (not unlikely!)

sprites.zip

cooperdk commented 3 years ago

One thing, because I just now did a phash on both sprites on my computer with imagehash (I don't use go, that's a buddy).

When he calculates his sprite in goimagehash, he gets the value above. (82207ed1a57cbf16)

When I do, I get the same hash as with my own sprite!

> phash = imagehash.phash(Image.open("temp\\phash_file.bmp"))
> >>> print(phash)
> 82207fd3a17cbd16
> >>> phash = imagehash.phash(Image.open("temp\\goimagehash.bmp"))
> >>> print(phash)
> 82207fd3a17cbd16

This can only mean that your libraries don't hash the same values...

Here are the binaries:

10000010 00100000 01111111 11010001 10101001 01111100 10111101 00010110

10000010 00100000 01111110 11010001 10100101 01111100 10111111 00010110

JohannesBuchner commented 3 years ago

phash isn't a uniquely defined procedure. There may also be anti-aliasing differences from the underlying libraries (as when swapping out Pillow with opencv in #130).

cooperdk commented 3 years ago

That's true. But phash and goimagehash are both based on the same article, as far as I can understand. We need modules that produce as close matches as possible using both goimagehash (for one application) and imagehash for my project.

I'll need to fork this to make that work perfectly (in almost every case, there are still minor discrepancies).

BTW, IMO, a hash specification actually should produce the same results no matter what tool you use. Still, a hamming distance of 2 bits in the cases where goimagehash and imagehash don't match ... is very little.

JohannesBuchner commented 3 years ago

If you are hashing video files, you can also ask ffmpeg to already produce down-scaled images for you. This will be much faster, space-efficient and avoid any down-sampling in subsequent tools.

cooperdk commented 3 years ago

We actually do resize the screencaps with ffmpeg. It seems that Pillow/PIL antialiases files on write, even though it does nothing but paste other images on a canvas. The hamming distance is small (AFAIK, in general, it could be as much as 6 to 8 for an perceptively identical video), we haven't experienced a distance of more than 2 (with a modified imagehasher) or 4 (with the original module).