commonsmachinery / blockhash

blockhash.io
MIT License
86 stars 28 forks source link

Algorithm improvement #2

Closed jonasob closed 9 years ago

jonasob commented 9 years ago

We're coming across a few images like the ones below that result in hashes that are close enough that they fall below or at 10 bits:

http://commons.wikimedia.org/wiki/File:Field,_tracks_and_The_Wrekin_-_geograph.org.uk_-_780016.jpg http://commons.wikimedia.org/wiki/File:N734511468_542359_2084.jpg

It's the age old problem of images with a lighter shade at the top and darker shade at the bottom. Not surprisingly, many landscape images are exactly like that :-)

It strikes me that perhaps there's a way to improve the algorithm by normalising images first. This could probably be very elaborate, but a simple mechanism would be to take every second column of the image and flip it 180 degrees. It should likely not have any effect on normal images, but on images like the above, it would result in hashes that are a bit more divergent since the differences between the lighter and darker parts would even out more.

@artfwo and @petli, what do you think about that?

jonasob commented 9 years ago

I made a test with the two files above indicated. This is what the blockhash algorithm produce today:

ffffffffffffffffffffffff07ff07ff01ff0000000000000000000000000000 ffffffffffffffffffffffff079fc7de63fe0000000000000000000000000000

If I flip every second column, I get these results for the same two files:

01ff03ff07ff0fff07ff001c001c001f001f001c001c07ff0fff0fff03fe01fe fffffffcffffff9ffc000000000000000000000000007c007fbefffefffc7ffe

jonasob commented 9 years ago

I tried this manually with 1000 files from Commons and this improved the quality of the hashes by a factor of two, roughly. If counting <=10 as a match, the baseline comparison was 3,33% missmatches. With flipping every other column, this is down to 1,52%.

The only remaining images that it fails more drastically on is maps, of which there are plenty in Commons. Specifically maps which differ only in so far that some smaller regions might be colored in one but not the other.

This is the Perl (yes, yes) script I used for the rotation: http://pastebin.com/jnFWSFVk

petli commented 9 years ago

Interesting! How did this affect false negatives, i.e. somewhat cropped or similarly modified images? Intuitively it feels like that could get worse if the hash starts mixing up the image. For our use case avoiding false positives should be more important than false negatives, though.

The rotation could easily be done when building up the bitstring from the calculated block values, to avoid image manipulation in the code.

jonasob commented 9 years ago

Never thought of that this could be calculated in the algorithm instead of actually rotating. I would like to see @artfwo implementing in the C version and then run his previous test suite so we can compare numbers. On 6 Nov 2014 10:42, "Peter Liljenberg" notifications@github.com wrote:

Interesting! How did this affect false negatives, i.e. somewhat cropped or similarly modified images? Intuitively it feels like that could get worse if the hash starts mixing up the image. For our use case avoiding false positives should be more important than false negatives, though.

The rotation could easily be done when building up the bitstring from the calculated block values, to avoid image manipulation in the code.

— Reply to this email directly or view it on GitHub https://github.com/commonsmachinery/blockhash/issues/2#issuecomment-61951327 .

petli commented 9 years ago

I misunderstood, thought you rotated every over column of blocks, not every other column of pixels. However, the latter can also be done in the algorithm, it's just a matter of indexing the pixels correcly.

But flipping on a pixel-level means that cropping the left side by an odd number of pixels would result in a wildly different hash, and resizes might also end up with larger distances than on onmodified pixels. Not sure that price is worth to pay.

A different approach could be to do a two-pass hashing to increase the contrast. E.g.:

Perhaps 2x2 groups give even better contrast.

petli commented 9 years ago

On further thought, on typical images the details are spread out over enough columns to not matter much which exact column is flipped.

Flipping the columns result in partially matching patterns in the hash, compare the first four hexdigits to the last four, then the next four and second last four, etc. Still brings out a bit more detail than in the unflipped version, though.

artfwo commented 9 years ago

Hamming distances comparison of the original algorithm vs. column-flipping on same transformations:

https://docs.google.com/a/commonsmachinery.se/spreadsheets/d/1C7oA12khUgCtZzUgulAPsjc75edMtcoiSutMyOxfll4/edit?usp=sharing

jonasob commented 9 years ago

In the same document is also a third variation, in which we've split the image into four separate blocks, calculating a 64 bit hash on each block, which is then assembled to a complete 256 bit hash. I was hypothesizing that since the main problems we've seen is in images with high contrast between top and bottom, then having separate median values for each height/4 block ought to have brought out more details from each block.

The original idea was to do blocks like this:

Block 1 Block 2
Block 3 Block 4

But the test which is in that document is a simpler version which simply divides into four equal horizontal blocks:

Block 1
Block 2
Block 3
Block 4

It doesn't seem like column flipping or block division have any positive effect on the matching. So it seems the best option is to still go with the original algorithm, but be more conservative about the maximum hamming distance allowed to signal a true match.

artfwo commented 9 years ago

I'll try implementing the method suggested by @petli and add results to the table too.

jonasob commented 9 years ago

Once we have the results of that, I have another general idea. Since it's not a problem matching broadly, but only a problem with 5-6% of images that seem to generate hashes that are far from unique, we could potentially generate a secondary hash that's more unique for that set of images.

So the logic would be something like that if blockhash(image) returns a value which matches some blacklist (00/ff for instance, with a 10 bit variation), then calculate othermorecumbersomealgorithm(image).

Of course, that hinges on us finding othermorecumbersomealgorithm, and that othermorecumbersomealgorithm isn't actually as fast and accurate as blockhash :-)

jonasob commented 9 years ago

There's now a new improved version on https://github.com/commonsmachinery/blockhash/tree/m4-improv (m4-improv branch) to check out. The changes in this is only in the way it calculates the median. Instead of calculating a single median for the entire image, it calculates four different medians for four separate horisontal blocks, like so:

Block 1
Block 2
Block 3
Block 4

In my tests, this creates an algorithm that matches pretty close to the original. It loses some matches on max_error 10, but nothing extreme. On the other hand, it significantly improves collision rates, from 35 collisions per 1000 with our original to 12 collisions per 1000 with this (random sample from WMC).

petli commented 9 years ago

The horisontal blocks seems a very good idea to handle typical image structures, and the code looks correct.

I'm not sure if anything would be improved by also doing the XOR with the full-image hash, but if it can be tested easily it's worth checking. An alternative might be to do four vertical blocks, and XOR the bits from them with the bits from the horisontal blocks.

But if just horisontal blocks gets us down to 1.2% collision rate, that sounds plenty good enough and has the virtue of being the easiest to specify and implement of the alternatives.

jonasob commented 9 years ago

I've tried XOR previously and the results are horrible. I haven't thought about why, but any attempt at XOR'ing values together give a very good and uniform distribution in terms of avoiding collisions, but gives very bad results on matching images with each other.

petli commented 9 years ago

m4 it is, then.

The test data need to be updated when the code is merged, btw.

petli commented 9 years ago

Do we have (or could produce) some updated metrics for the m4 algorithm for the RFC draft appendix? https://github.com/commonsmachinery/blockhash-rfc/blob/m4/appendices.md

I guess some scripts were used to create this earlier, but I haven't seem them anywhere.

jonasob commented 9 years ago

On a sample of 100,000 random images from Wikimedia Commons, the algorithm generated colliding matches for 1,036 images, meaning that it has a collision rate of ca 1%. The vast majority of such collisions are where two-four images generate the same hash; only rarely does a hash match more than four images.

Number of same hash images Occurance
2 247
3 51
4 19
5 8
6 9
7 3
8 or more 16

Taking 4,000 unique sample images (not counting collisions) and doing a crosswise comparison of them, calculating the hamming distance between one hash and every other hash, we get the following distribution of hamming distances (up to 10 bits):

Hamming distance Occurance Occurance (%)
1 0 0%
2 2 0,05%
3 2 0,05%
4 7 0,175%
5 7 0,175%
6 25 0,625%
7 30 0,75%
8 47 1,175%
9 55 1,375%
10 73 1,825%