Add support for ISCC content hash for images (ie. phash variation)

zache-fi commented 2 months ago

This feature request is to being able to calculate ISCC content hash for images using Imagehash lib so they could be compared easily.

International Standard Content Code (ISCC) is standard proposal for content identifier for text, images, audio and video. It uses hashes which contains three independent 64 bit similarity hash blocks (Metadata, Content, Binary) and file checksum. All four can be compared independently. There is ISO 24138:2024 proposal published at 2024-05. Reference code and SDK is under Apache licence.

Home pages

Imagehashing part ISCC content hashing for images is similar than in Imagehash library's phash but has preprocess steps to confirm that data is in uniform format. It also doesn't use Numpy in core, but uses Pillow for preprosessing the image (scaling, cropping, grayscaling etc) and ISCC uses BICUBIC for scaling compared LANCZOS used by Imagehash library.

I made a quick test that when steps 1-3 were disabled and resizing was done using Pillow's LANCZOS instead of BICUBIC ISCC returned identical values compared to Imagehash library's phash.

Steps for calculating ISCC content hash for images

Transpose image according to EXIF Orientation
Add white background to image if it has alpha transparency
Crop empty borders of image
Convert image to grayscale
Resize image to 32x32 using BICUBIC
Flatten 32x32 matrix to an array of 1024 grayscale (uint8) pixel values
Compute the 32x32 DCT
Keep the top-left 8x8 of DCT (lowest frequencies)
compute the median DCT value
Set the 64 hash bits to 0 or 1 depending on whether each of the 64 DCT values is above or below the median value

Source code

Related investigation ticket in Wikimedia's phabricator

https://phabricator.wikimedia.org/T373285

JohannesBuchner commented 2 months ago

I think something similar was discussed at https://github.com/JohannesBuchner/imagehash/pull/110

I would be happy to include this, but I would suggest making it a preprocessor helper function, and add to the README how to use it, and recommend it there.

Regarding the steps: "Transpose image according to EXIF Orientation": good "Add white background to image if it has alpha transparency": I am not sure every user will want that, maybe the background color can be an argument. "Crop empty borders of image": what is "empty"? using the background color? what happens if the image is all white? I am not sure every user will want that. "Convert image to grayscale", "Resize image to 32x32 using BICUBIC", I think these are implemented in each hash functions (maybe not necessarily with bicubic), and can stay there. That does not prevent users from first preprocessing themselves with these steps, then passing to the hash functions, in which they probably would not do any further preprocessing.

Then I think we could add a iscc_content_hash function which combines the blocks above.

zache-fi commented 2 months ago

When default settings are used, the target should be to follow the ISCC so that the result is identical to the reference implementation. Beyond that, it is a good idea to allow the user to use one's own settings via parameters.

"Crop empty borders of image": what is "empty"? using the background color? what happens if the image is all white? I am not sure every user will want that.

It uses pixel 0,0 as the background color, uses it as a mask, and then uses Pillows getbox() to get the box inside the mask. Getbox return 0 if it fails and then ... I also think that it is hard to implement well, but if it is specs then follow the reference implementation and make it optional/adjustable via parameters.

About resampling methods: If the ISCC spec/reference implementation uses Pillows BICUBIC, then the resampling function should be the same. Using a different one will add extra "noise" to the results, which should be avoided if possible.

In some use cases, getting exact hashes matters. For example, I am currently using Mariadb as a database with combined dhash+phash hashes, so either one needs to be an exact match. With this, I can query them against db indexes very fast, and using dual hashes significantly reduces false negatives and because that i am investigating howto implement Pillows LANCZOS (or with BICUBIC in this case) using Java to get rid of extra noise when phash is calculated using Java code.

SELECT p1.page_id, p2.page_id FROM imagehash i1, imagehash as i2 
WHERE   i1.page_id = PAGE_ID 
AND   i1.page_id != i2.page_id 
AND 
( 
   ( i1.phash = i2.phash AND BIT_COUNT(i1.dhash ^ i2.dhash)< 4 )
   OR 
   ( i1.dhash = i2.dhash AND BIT_COUNT(i1.phash ^ i2.phash)< 4 )
)

JohannesBuchner commented 2 months ago

OK cool, yes, a pull request is welcome.

JohannesBuchner / imagehash

Add support for ISCC content hash for images (ie. phash variation) #212