Add Image Hash for Cover

bpepple commented 1 year ago

Now your pie-in-sky idea makes much more sense to me as an efficient solution. Having a hash created when an image is uploaded shouldn't be terribly difficult to implement (and hashing existing one would be fairly trivial). A couple of things would need to be decided before implementing, tho:
1. Hashing method: Not sure what ComicTagger uses, but Python's built-in hashing functions (md5, sha256, etc) most likely aren't a great option, but something like ImageHash provides some decent options. Regardless tho, we would need to coordinate the method between the projects to make this data useful for matching purposes.
2. Images to hash: Should it just be the Primary Cover? Or should it also include Variants? If it's just the Primary Cover adding a coverhash field to the Issue model in Django would be needed and updating to the save() and delete() methods, otherwise we would probably need to create another DB table to hold this data. My initial thought is just the Primary Cover, since it's easier to implement and I wonder just how many digital comics have a Variant as it's first image.

Anyway, I think this is a pretty good idea and it's probably worth opening a Feature Request Bug on it.

Originally posted by @bpepple in https://github.com/bpepple/metron/discussions/164#discussioncomment-5916373

mizaki commented 1 year ago

i. I've checked the difference between IH and CT average hashing. The final result is different but bit wise they have the same, it's only the conversion of the binary result that differs. Not an issue anyway tbh. Whether it's worth storing the average hash and perceptual hash is a question.

ii. It makes sense to my mind to hash the variant covers. CT will check those for cover image matches. I can imagine someone using their favourite variant cover as the first image.

bpepple commented 1 year ago

ii. It makes sense to my mind to hash the variant covers. CT will check those for cover image matches. I can imagine someone using their favourite variant cover as the first image.

I find it extremely unlikely it's worth the effort to add hashes for variant covers, due:

I've never seen a digital comic from Comixology or Humble Bundle that has a variant as the first image. Really, the only time I can see that happening is if the user manually modifies the page names within the archive, which in my mind is an absolutely nuts thing to be doing. If the user wants to use a variant cover as the main comic image, it should be handled by the software reader to set that, not manually altering the comic archive itself. Truthfully, that's why I think it's a really bad decision to even do matching on a variant cover. Just because you can implement a feature, doesn't mean you should.
To be useful, the hash would most likely be used in the issue-list endpoint, which is the most used endpoint and would have the most impact on the server/performance. And frankly, for something that is unlikely to occur very often, the potential impact to QOS of adding variant hashes just isn't worth it.

bpepple commented 1 year ago

Did some quick testing this morning, and adding a cover hash to existing issues in the database, would just need a management command like the following:

from typing import Any
from django.core.management.base import BaseCommand

from comicsdb.models.issue import Issue
from PIL import Image
import imagehash

class Command(BaseCommand):
    help = "Add missing cover hashes"

    def handle(self, *args: Any, **options: Any) -> str | None:
        missing_count = Issue.objects.all().exclude(image="").filter(cover_hash="").count()
        while missing_count > 0:
            issues = Issue.objects.exclude(image="").filter(cover_hash="")[:250]
            for i in issues:
                try:
                    cover_hash = imagehash.phash(Image.open(i.image))
                except OSError as e:
                    print(f"Skipping {i}. Error: {e}")
                    continue
                i.cover_hash = str(cover_hash)
                print(f"Set cover hash of '{cover_hash}' for '{i}'")

            update_count = Issue.objects.bulk_update(issues, ["cover_hash"])
            print(f"Updated {update_count} issues in the database")
            missing_count = missing_count - update_count

Still need to add functionality to generate the cover hash when creating an issue and adding the information to the REST API, but overall it's a fairly easy task.

mizaki commented 1 year ago

Interested to know if there are any/many duplicates after it's done.

bpepple commented 1 year ago

Interested to know if there are any/many duplicates after it's done.

Running a query on my test database I get the following:

>>> Issue.objects.exclude(cover_hash="").count()
64773
>>> dupes = Issue.objects.exclude(cover_hash="").values("cover_hash").annotate(Count('id')).order_by().filter(id__count__gt=1)
>>> len(dupes)
5
>>> dupes
<QuerySet [{'cover_hash': 'b6d2482d78e267a6', 'id__count': 2}, {'cover_hash': 'c5b4384f7438cd78', 'id__count': 2}, {'cover_hash': 'd217f91f60255a33', 'id__count': 2}, {'cover_hash': 'c1b03e4f727acc25', 'id__count': 2}, {'cover_hash': 'c2ed6a06b93b52ca', 'id__count': 2}]>

Out of 64,773 issues, there seems to be 5 instances where the cover hash is the same. Not sure if increasing the hash size when generating the value would prevent dupes or not. 🤷

mizaki commented 1 year ago

Out of 64,773 issues, there seems to be 5 instances where the cover hash is the same. Not sure if increasing the hash size when generating the value would prevent dupes or not. shrug

I'm wondering what those covers are now :)

It at least looks possible that most issues might be identifiable with hash alone. Falling back to series, issue list and hamming distance then just going by series, issue number (and any others like year etc.).

bpepple commented 1 year ago

I'm wondering what those covers are now :)

I'm going to look at those 5 dup's tomorrow, it's very possible they could have a legitimate reason for having similar covers (like a reprint, tpb, etc).

bpepple commented 1 year ago

I'm wondering what those covers are now :)

Ok, running the following on the test database:

>>> qs = Issue.objects.exclude(cover_hash="").values("cover_hash").annotate(Count('id')).order_by().filter(id__count__gt=1)
>>> for ch in qs:
...     result = Issue.objects.filter(cover_hash=ch["cover_hash"])
...     for i in result:
...             print(f"{i}")
...     print("----------")
... 
House of Slaughter (2021) #1
House of Slaughter TPB (2022) #1
----------
Crossover (2020) #12
Crossover TPB (2021) #2
----------
Manhunter (1988) #16
Manhunter (1988) #17
----------
Crossover (2020) #1
Crossover TPB (2021) #1
----------
Apache Delivery Service (2022) #1
Apache Delivery Service TPB (2022) #1
----------

This produces what I was sort of expecting. All those TPB have the same image as the cover of the corresponding issues.

The only result the was out-of-whack was the Manhunter matches. Looking into that it further, the hashing actually caught an error in the database where the two issues had the same image (in this case issue 16 of the 1988 Manhunter series, which I'll correct on the production DB later today). So, based on the dataset it seems the only time it will produce a duplicate value is when a corresponding issue (like a TPB) has the same cover.

With that being said, it might still be a good idea to investigate if increasing the hash_size from the default of 8 to something like 10 or 12 would be beneficial.

mizaki commented 1 year ago

With that being said, it might still be a good idea to investigate if increasing the hash_size from the default of 8 to something like 10 or 12 would be beneficial.

It might with the House of Slaughter but I think Crossover would be more difficult. Also, it appears that 8 has become the de facto standard and is going against that worth it for the (seemingly) minimum benefit?

bpepple commented 1 year ago

It at least looks possible that most issues might be identifiable with hash alone. Falling back to series, issue list and hamming distance then just going by series, issue number (and any others like year etc.).

My guess is it will still be better to search series & issue number since the API filter on the cover hash will need an exact match. For newer issues it would most likely work filtering on the cover hash, but older issues where the cover images aren't that great, I'd have my doubts how well that work. Anyway, that's something that could be tested.

I'm pretty much done writing this feature, but still need to make a final decision on the hashing method to use (pHash or aHash). Once that is made, I should be able to push this production.

mizaki commented 1 year ago

My guess is it will still be better to search series & issue number since the API filter on the cover hash will need an exact match. For newer issues it would most likely work filtering on the cover hash, but older issues where the cover images aren't that great, I'd have my doubts how well that work. Anyway, that's something that could be tested.

My thinking is where there is no issue number (failed filename parse etc.). The issue list for the series could be grabbed including the hash and then hamming could be used etc. That way is possible now but is obviously wasteful (and rude) to the api/website because all the issue covers for that series would need to be downloaded.

bpepple commented 1 year ago

My thinking is where there is no issue number (failed filename parse etc.). The issue list for the series could be grabbed including the hash and then hamming could be used etc. That way is possible now but is obviously wasteful (and rude) to the api/website because all the issue covers for that series would need to be downloaded.

Doesn't that primarily happen for things like a TPB or GN that don't have a number one? Regardless, using an issue list for the series seems to be a good solution.

I'm most likely going to merge this tomorrow, since I haven't gotten much feedback regarding the hashing method. Currently, I'm planning to use a pHash, unless I hear a compelling argument before than.

bpepple / metron

Add Image Hash for Cover #170