JohannesBuchner / imagehash

A Python Perceptual Image Hashing Module
BSD 2-Clause "Simplified" License
3.28k stars 331 forks source link

Efficient crop resistance #120

Closed SpangleLabs closed 4 years ago

SpangleLabs commented 4 years ago

I think this is an implementation of the Efficient Crop-Resistant Image Hashing paper, mentioned here: https://github.com/JohannesBuchner/imagehash/issues/117

There's some ambiguities in the paper, and I've emailed the first author, but I've not (yet) heard back. So I've had to make some assumptions, as well as some optimisations. Honestly, this proved a bit trickier than I expected, as a result.

I'm not fully sold on the name segmented_hash, maybe crop_resistant_hash might be better, but it's rather long. I do need to add a test for the actual.. crop resistance! But I wanted to get a travis run, so I can see how it does in python2 (I can almost guarantee I'll need to make changes.)

I somewhat regret planning this as my first Hacktoberfest PR this year, but I really hope it is helpful. (And, as the rules of Hacktoberfest have changed a little this year, I would really appreciate if you tagged the PR with hacktoberfest-accepted once you think it looks good enough)

I'm gonna need to head to sleep now though, tomorrow I'll have to add a couple images and update this with a crop-resistance test. I'll probably have to fix python2 errors too. I'm marking it as draft in the meanwhile.

coveralls commented 4 years ago

Coverage Status

Coverage increased (+4.1%) to 88.971% when pulling 12cb134cfa14ffed14eccde9ed17c35cfe2e8da7 on joshcoales:efficient-crop-resistance into 4de9becdb13ecad67b7393cc17b5a44ea1c61b6b on JohannesBuchner:master.

coveralls commented 4 years ago

Coverage Status

Coverage increased (+4.4%) to 89.219% when pulling d180db03f143a892da2bdafc9401b6ea7ddaef45 on joshcoales:efficient-crop-resistance into 4de9becdb13ecad67b7393cc17b5a44ea1c61b6b on JohannesBuchner:master.

JohannesBuchner commented 4 years ago

Great! I am looking forward to some demos and tests. No problem with the tagging.

Yes, the naming should be consistent. If the comment of the function starts with "this implements crop resistant hashing", that should probably be the name of the function. I guess it could be cropres_hash or crhash if you want it short, although it doesn't need to be short.

SpangleLabs commented 4 years ago

Implemented most those suggestions, though I still have a question as to whether the guassian and median filters should be tunable, and provided as arguments, or if they should just use the default values for the filters.

I've added a test for crop resistance too, still surprised I missed that one out! I've used the peppers image from SIPI which is also the image used in the paper. I could have saved a lot of tuning by trying it on that image earlier, but I was happy to see my segmentation looks very similar to the segmentation in the paper.

Also, it seems travis isn't showing up on this PR? But I can see it is running: https://travis-ci.org/github/JohannesBuchner/imagehash/pull_requests And I'm super surprised to see that it works in python2 after only a little change. (Though, it does pour out a lot of warnings about python2 being deprecated. But it looks like you still have 1000-2000 downloads per day by python2 users, so I'm not sure how much you value supporting that)

SpangleLabs commented 4 years ago

I wasn't sure also whether to implement a ImageMultiHash.__sub__ method, but it would need to return some sort of Difference object, or something. To hold info on the number of segment matches, and the hamming distance between each. It seemed like more hassle than it is necessarily worth?

JohannesBuchner commented 4 years ago

Please check the license on any image you include in the test (it can be more than one). We need to be allowed to redistribute.

JohannesBuchner commented 4 years ago

I wasn't sure also whether to implement a ImageMultiHash.__sub__ method, but it would need to return some sort of Difference object, or something. To hold info on the number of segment matches, and the hamming distance between each. It seemed like more hassle than it is necessarily worth?

For __sub__ to behave like ImageHash, presumably it should expose matches somehow. Perhaps return 1/matches or -matches, so that it increases with few matches?

JohannesBuchner commented 4 years ago

To test the implementation, could you try cropping the border away in increasing fractions (automatic cropping), and perhaps also trying asymmetrically. Also, try varying the hash function and test a few images.

SpangleLabs commented 4 years ago

Please check the license on any image you include in the test (it can be more than one). We need to be allowed to redistribute.

Oops, it seems the Peppers image is unknown copyright: http://sipi.usc.edu/database/copyright.php That seems rather silly. I'll swap with this public domain image: https://www.publicdomainpictures.net/en/view-image.php?image=365025&picture=colorful-bell-peppers

For sub to behave like ImageHash, presumably it should expose matches somehow. Perhaps return 1/matches or -matches, so that it increases with few matches?

The issue is ensuring that if the number of regions matched is a tie, it goes by decreasing sum of hamming distance. That could be implemented by just returning a tuple of (-matches, sum(hamming_distances))?

To test the implementation, could you try cropping the border away in increasing fractions (automatic cropping), and perhaps also trying asymmetrically.

Okay, I've removed the fixed crop image, and I'm doing the cropping in the test

Also, try varying the hash function and test a few images.

I have a test that swaps out the hash function, in test_segmented_hash__hash_func(), Do you want one that checks that with a different hash function, the image still matches?

SpangleLabs commented 4 years ago

Have you had any more chance to look at this? Doesn't need merging urgently, but I would really appreciate if you could give it the hacktoberfest-accepted label to mark it as not-spam for the event

JohannesBuchner commented 4 years ago

For sub to behave like ImageHash, presumably it should expose matches somehow. Perhaps return 1/matches or -matches, so that it increases with few matches?

The issue is ensuring that if the number of regions matched is a tie, it goes by decreasing sum of hamming distance. That could be implemented by just returning a tuple of (-matches, sum(hamming_distances))?

Can you make sum(hamming_distances) so that it is between 0 and 1 (exclusive), so that it can be added to matches to break ties?

To test the implementation, could you try cropping the border away in increasing fractions (automatic cropping), and perhaps also trying asymmetrically.

Okay, I've removed the fixed crop image, and I'm doing the cropping in the test

Also, try varying the hash function and test a few images.

I have a test that swaps out the hash function, in test_segmented_hash__hash_func(), Do you want one that checks that with a different hash function, the image still matches?

Sorry, I was being unclear. I think having a static test image with a crop is good to see directly what the test is comparing.

What I meant was, whether you can write a short script on your computer which takes an image, and with a loop crops more and more away, and reports how the hash distance changes. Just for reporting here how well the implementation works. That code does not need to go into the repository, although there could be space for it in examples/ or in tests/ with a non-'test_' prefix. I did not mean to add it as a unit test.

In the main README, there is a "example results" section which illustrates/reports a bit how well various hashes work. I imagined a similar demo for this hash.

SpangleLabs commented 4 years ago

For sub to behave like ImageHash, presumably it should expose matches somehow. Perhaps return 1/matches or -matches, so that it increases with few matches?

The issue is ensuring that if the number of regions matched is a tie, it goes by decreasing sum of hamming distance. That could be implemented by just returning a tuple of (-matches, sum(hamming_distances))?

Can you make sum(hamming_distances) so that it is between 0 and 1 (exclusive), so that it can be added to matches to break ties?

I can scale sum(hamming_distances) by the number of matches and the max distance per match (the bit length of the hash), but the issue then is that they are backwards? Higher matches is good, higher sum distance is bad. I could do 1-{scaled hamming distance sum}? I suppose that works, seems a bit ugly. Done this in ba7d547843cd06649ebb7481e4a4e437fb0adb12 and 6a5bbc7b73d5de9a73af30e780ab4fdca860e7f6

To test the implementation, could you try cropping the border away in increasing fractions (automatic cropping), and perhaps also trying asymmetrically.

Okay, I've removed the fixed crop image, and I'm doing the cropping in the test

Also, try varying the hash function and test a few images.

I have a test that swaps out the hash function, in test_segmented_hash__hash_func(), Do you want one that checks that with a different hash function, the image still matches?

Sorry, I was being unclear. I think having a static test image with a crop is good to see directly what the test is comparing.

What I meant was, whether you can write a short script on your computer which takes an image, and with a loop crops more and more away, and reports how the hash distance changes. Just for reporting here how well the implementation works. That code does not need to go into the repository, although there could be space for it in examples/ or in tests/ with a non-'test_' prefix. I did not mean to add it as a unit test.

In the main README, there is a "example results" section which illustrates/reports a bit how well various hashes work. I imagined a similar demo for this hash.

Still working on that, head's a bit of a mess atm due to moving house this week, sorry

SpangleLabs commented 4 years ago

Okay, I've added two example scripts now.

The first one is examples/crop_resistant_segmentation.py, and I reformatted the method in imagehash a bit for it. But it just draws the segments. I found that a very pretty thing to see, to understand what's going on.

The second is examples/crop_resistance.py, which crops more and more from an image and checks the match results. It gives some odd results for peppers.png, but gave some better results for the illegal peppers image from the research paper. It seems rather variable when cropping evenly from every side, which is expected. The algorithm assumes that some of the "segments" from the original image will remain in the cropped version, i.e. that features will be cropped towards, rather than just the image center. This is not always true.

peppers.png cropping:

Cropped 5% from each side. Hash has 2 matching segments with 21 total hamming distance
Cropped 10% from each side. Hash has 1 matching segments with 16 total hamming distance
Cropped 15% from each side. Hash has 0 matching segments with 0 total hamming distance
Cropped 20% from each side. Hash has 1 matching segments with 16 total hamming distance
Cropped 25% from each side. Hash has 0 matching segments with 0 total hamming distance
Cropped 30% from each side. Hash has 0 matching segments with 0 total hamming distance
Cropped 35% from each side. Hash has 1 matching segments with 14 total hamming distance
Cropped 40% from each side. Hash has 0 matching segments with 0 total hamming distance
Cropped 45% from each side. Hash has 0 matching segments with 0 total hamming distance

I'm not totally certain why it unmatches, then matches again. I'm guessing that is just hash collisions?

Illegal peppers cropping:

Cropped 5% from each side. Hash has 3 matching segments with 16 total hamming distance
Cropped 10% from each side. Hash has 1 matching segments with 2 total hamming distance
Cropped 15% from each side. Hash has 2 matching segments with 22 total hamming distance
Cropped 20% from each side. Hash has 1 matching segments with 16 total hamming distance
Cropped 25% from each side. Hash has 1 matching segments with 11 total hamming distance
Cropped 30% from each side. Hash has 0 matching segments with 0 total hamming distance
Cropped 35% from each side. Hash has 0 matching segments with 0 total hamming distance
Cropped 40% from each side. Hash has 0 matching segments with 0 total hamming distance
Cropped 45% from each side. Hash has 0 matching segments with 0 total hamming distance
JohannesBuchner commented 4 years ago

There are some travis errors https://travis-ci.org/github/JohannesBuchner/imagehash/jobs/739137620:

======================================================================

ERROR: test_segmented_hash (tests.test_crop_resistant_hash.Test)

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/home/travis/build/JohannesBuchner/imagehash/tests/test_crop_resistant_hash.py", line 30, in test_segmented_hash

    original_hash.best_match(other_hashes),

  File "/home/travis/build/JohannesBuchner/imagehash/imagehash.py", line 447, in best_match

    key=lambda other_hash: self.__sub__(other_hash, hamming_cutoff, bit_error_rate)

  File "/home/travis/build/JohannesBuchner/imagehash/imagehash.py", line 447, in <lambda>

    key=lambda other_hash: self.__sub__(other_hash, hamming_cutoff, bit_error_rate)

  File "/home/travis/build/JohannesBuchner/imagehash/imagehash.py", line 396, in __sub__

    return matches + (1 - (sum_distance / max_distance))

ZeroDivisionError: division by zero

======================================================================

FAIL: test_crop_resistance (tests.test_crop_resistant_hash.Test)

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/home/travis/build/JohannesBuchner/imagehash/tests/test_crop_resistant_hash.py", line 102, in test_crop_resistance

    self.assertEqual(crop_hash_40, full_hash, "Heavily cropped image hash should match full image hash")

AssertionError: <imagehash.ImageMultiHash object at 0x7f36fee65b00> != <imagehash.ImageMultiHash object at 0x7f370387d710> : Heavily cropped image hash should match full image hash

and also:

Traceback (most recent call last):

  File "find_similar_images.py", line 77, in <module>

    find_similar_images(userpaths=userpaths, hashfunc=hashfunc)

  File "find_similar_images.py", line 28, in find_similar_images

    if hash in images:

TypeError: unhashable type: 'ImageMultiHash'

I think you need to implement __hash__, which should return an int (e.g., for dicts to be able to hash it).

SpangleLabs commented 4 years ago

Ah, damn. Also, I'm realising that sub() is just backward, right? Zero matches => sub() returns zero. That's totally now what we want... Urgh, sorry! I'm not really sure the best way to implement that, do you have any ideas?

SpangleLabs commented 4 years ago

Ah, I can just take the segment count and subtract the match count. Doing that in #121 and fixing the divide by zero issue