bpepple / metron

Django website for a comic book database
https://metron.cloud/
GNU General Public License v3.0
62 stars 7 forks source link

Add Image Hash for Cover #170

Closed bpepple closed 1 year ago

bpepple commented 1 year ago

Anyway, I think this is a pretty good idea and it's probably worth opening a Feature Request Bug on it.

Originally posted by @bpepple in https://github.com/bpepple/metron/discussions/164#discussioncomment-5916373

mizaki commented 1 year ago

i. I've checked the difference between IH and CT average hashing. The final result is different but bit wise they have the same, it's only the conversion of the binary result that differs. Not an issue anyway tbh. Whether it's worth storing the average hash and perceptual hash is a question.

ii. It makes sense to my mind to hash the variant covers. CT will check those for cover image matches. I can imagine someone using their favourite variant cover as the first image.

bpepple commented 1 year ago

ii. It makes sense to my mind to hash the variant covers. CT will check those for cover image matches. I can imagine someone using their favourite variant cover as the first image.

I find it extremely unlikely it's worth the effort to add hashes for variant covers, due:

bpepple commented 1 year ago

Did some quick testing this morning, and adding a cover hash to existing issues in the database, would just need a management command like the following:

from typing import Any
from django.core.management.base import BaseCommand

from comicsdb.models.issue import Issue
from PIL import Image
import imagehash

class Command(BaseCommand):
    help = "Add missing cover hashes"

    def handle(self, *args: Any, **options: Any) -> str | None:
        missing_count = Issue.objects.all().exclude(image="").filter(cover_hash="").count()
        while missing_count > 0:
            issues = Issue.objects.exclude(image="").filter(cover_hash="")[:250]
            for i in issues:
                try:
                    cover_hash = imagehash.phash(Image.open(i.image))
                except OSError as e:
                    print(f"Skipping {i}. Error: {e}")
                    continue
                i.cover_hash = str(cover_hash)
                print(f"Set cover hash of '{cover_hash}' for '{i}'")

            update_count = Issue.objects.bulk_update(issues, ["cover_hash"])
            print(f"Updated {update_count} issues in the database")
            missing_count = missing_count - update_count

Still need to add functionality to generate the cover hash when creating an issue and adding the information to the REST API, but overall it's a fairly easy task.

mizaki commented 1 year ago

Interested to know if there are any/many duplicates after it's done.

bpepple commented 1 year ago

Interested to know if there are any/many duplicates after it's done.

Running a query on my test database I get the following:

>>> Issue.objects.exclude(cover_hash="").count()
64773
>>> dupes = Issue.objects.exclude(cover_hash="").values("cover_hash").annotate(Count('id')).order_by().filter(id__count__gt=1)
>>> len(dupes)
5
>>> dupes
<QuerySet [{'cover_hash': 'b6d2482d78e267a6', 'id__count': 2}, {'cover_hash': 'c5b4384f7438cd78', 'id__count': 2}, {'cover_hash': 'd217f91f60255a33', 'id__count': 2}, {'cover_hash': 'c1b03e4f727acc25', 'id__count': 2}, {'cover_hash': 'c2ed6a06b93b52ca', 'id__count': 2}]>

Out of 64,773 issues, there seems to be 5 instances where the cover hash is the same. Not sure if increasing the hash size when generating the value would prevent dupes or not. 🤷

mizaki commented 1 year ago

Out of 64,773 issues, there seems to be 5 instances where the cover hash is the same. Not sure if increasing the hash size when generating the value would prevent dupes or not. shrug

I'm wondering what those covers are now :)

It at least looks possible that most issues might be identifiable with hash alone. Falling back to series, issue list and hamming distance then just going by series, issue number (and any others like year etc.).

bpepple commented 1 year ago

I'm wondering what those covers are now :)

I'm going to look at those 5 dup's tomorrow, it's very possible they could have a legitimate reason for having similar covers (like a reprint, tpb, etc).

bpepple commented 1 year ago

I'm wondering what those covers are now :)

Ok, running the following on the test database:

>>> qs = Issue.objects.exclude(cover_hash="").values("cover_hash").annotate(Count('id')).order_by().filter(id__count__gt=1)
>>> for ch in qs:
...     result = Issue.objects.filter(cover_hash=ch["cover_hash"])
...     for i in result:
...             print(f"{i}")
...     print("----------")
... 
House of Slaughter (2021) #1
House of Slaughter TPB (2022) #1
----------
Crossover (2020) #12
Crossover TPB (2021) #2
----------
Manhunter (1988) #16
Manhunter (1988) #17
----------
Crossover (2020) #1
Crossover TPB (2021) #1
----------
Apache Delivery Service (2022) #1
Apache Delivery Service TPB (2022) #1
----------

This produces what I was sort of expecting. All those TPB have the same image as the cover of the corresponding issues.

The only result the was out-of-whack was the Manhunter matches. Looking into that it further, the hashing actually caught an error in the database where the two issues had the same image (in this case issue 16 of the 1988 Manhunter series, which I'll correct on the production DB later today). So, based on the dataset it seems the only time it will produce a duplicate value is when a corresponding issue (like a TPB) has the same cover.

With that being said, it might still be a good idea to investigate if increasing the hash_size from the default of 8 to something like 10 or 12 would be beneficial.

mizaki commented 1 year ago

With that being said, it might still be a good idea to investigate if increasing the hash_size from the default of 8 to something like 10 or 12 would be beneficial.

It might with the House of Slaughter but I think Crossover would be more difficult. Also, it appears that 8 has become the de facto standard and is going against that worth it for the (seemingly) minimum benefit?

bpepple commented 1 year ago

It at least looks possible that most issues might be identifiable with hash alone. Falling back to series, issue list and hamming distance then just going by series, issue number (and any others like year etc.).

My guess is it will still be better to search series & issue number since the API filter on the cover hash will need an exact match. For newer issues it would most likely work filtering on the cover hash, but older issues where the cover images aren't that great, I'd have my doubts how well that work. Anyway, that's something that could be tested.

I'm pretty much done writing this feature, but still need to make a final decision on the hashing method to use (pHash or aHash). Once that is made, I should be able to push this production.

mizaki commented 1 year ago

My guess is it will still be better to search series & issue number since the API filter on the cover hash will need an exact match. For newer issues it would most likely work filtering on the cover hash, but older issues where the cover images aren't that great, I'd have my doubts how well that work. Anyway, that's something that could be tested.

My thinking is where there is no issue number (failed filename parse etc.). The issue list for the series could be grabbed including the hash and then hamming could be used etc. That way is possible now but is obviously wasteful (and rude) to the api/website because all the issue covers for that series would need to be downloaded.

bpepple commented 1 year ago

My thinking is where there is no issue number (failed filename parse etc.). The issue list for the series could be grabbed including the hash and then hamming could be used etc. That way is possible now but is obviously wasteful (and rude) to the api/website because all the issue covers for that series would need to be downloaded.

Doesn't that primarily happen for things like a TPB or GN that don't have a number one? Regardless, using an issue list for the series seems to be a good solution.

I'm most likely going to merge this tomorrow, since I haven't gotten much feedback regarding the hashing method. Currently, I'm planning to use a pHash, unless I hear a compelling argument before than.