Closed bpepple closed 1 year ago
i. I've checked the difference between IH and CT average hashing. The final result is different but bit wise they have the same, it's only the conversion of the binary result that differs. Not an issue anyway tbh. Whether it's worth storing the average hash and perceptual hash is a question.
ii. It makes sense to my mind to hash the variant covers. CT will check those for cover image matches. I can imagine someone using their favourite variant cover as the first image.
ii. It makes sense to my mind to hash the variant covers. CT will check those for cover image matches. I can imagine someone using their favourite variant cover as the first image.
I find it extremely unlikely it's worth the effort to add hashes for variant covers, due:
Did some quick testing this morning, and adding a cover hash to existing issues in the database, would just need a management command like the following:
from typing import Any
from django.core.management.base import BaseCommand
from comicsdb.models.issue import Issue
from PIL import Image
import imagehash
class Command(BaseCommand):
help = "Add missing cover hashes"
def handle(self, *args: Any, **options: Any) -> str | None:
missing_count = Issue.objects.all().exclude(image="").filter(cover_hash="").count()
while missing_count > 0:
issues = Issue.objects.exclude(image="").filter(cover_hash="")[:250]
for i in issues:
try:
cover_hash = imagehash.phash(Image.open(i.image))
except OSError as e:
print(f"Skipping {i}. Error: {e}")
continue
i.cover_hash = str(cover_hash)
print(f"Set cover hash of '{cover_hash}' for '{i}'")
update_count = Issue.objects.bulk_update(issues, ["cover_hash"])
print(f"Updated {update_count} issues in the database")
missing_count = missing_count - update_count
Still need to add functionality to generate the cover hash when creating an issue and adding the information to the REST API, but overall it's a fairly easy task.
Interested to know if there are any/many duplicates after it's done.
Interested to know if there are any/many duplicates after it's done.
Running a query on my test database I get the following:
>>> Issue.objects.exclude(cover_hash="").count()
64773
>>> dupes = Issue.objects.exclude(cover_hash="").values("cover_hash").annotate(Count('id')).order_by().filter(id__count__gt=1)
>>> len(dupes)
5
>>> dupes
<QuerySet [{'cover_hash': 'b6d2482d78e267a6', 'id__count': 2}, {'cover_hash': 'c5b4384f7438cd78', 'id__count': 2}, {'cover_hash': 'd217f91f60255a33', 'id__count': 2}, {'cover_hash': 'c1b03e4f727acc25', 'id__count': 2}, {'cover_hash': 'c2ed6a06b93b52ca', 'id__count': 2}]>
Out of 64,773 issues, there seems to be 5 instances where the cover hash is the same. Not sure if increasing the hash size when generating the value would prevent dupes or not. 🤷
Out of 64,773 issues, there seems to be 5 instances where the cover hash is the same. Not sure if increasing the hash size when generating the value would prevent dupes or not. shrug
I'm wondering what those covers are now :)
It at least looks possible that most issues might be identifiable with hash alone. Falling back to series, issue list and hamming distance then just going by series, issue number (and any others like year etc.).
I'm wondering what those covers are now :)
I'm going to look at those 5 dup's tomorrow, it's very possible they could have a legitimate reason for having similar covers (like a reprint, tpb, etc).
I'm wondering what those covers are now :)
Ok, running the following on the test database:
>>> qs = Issue.objects.exclude(cover_hash="").values("cover_hash").annotate(Count('id')).order_by().filter(id__count__gt=1)
>>> for ch in qs:
... result = Issue.objects.filter(cover_hash=ch["cover_hash"])
... for i in result:
... print(f"{i}")
... print("----------")
...
House of Slaughter (2021) #1
House of Slaughter TPB (2022) #1
----------
Crossover (2020) #12
Crossover TPB (2021) #2
----------
Manhunter (1988) #16
Manhunter (1988) #17
----------
Crossover (2020) #1
Crossover TPB (2021) #1
----------
Apache Delivery Service (2022) #1
Apache Delivery Service TPB (2022) #1
----------
This produces what I was sort of expecting. All those TPB have the same image as the cover of the corresponding issues.
The only result the was out-of-whack was the Manhunter matches. Looking into that it further, the hashing actually caught an error in the database where the two issues had the same image (in this case issue 16 of the 1988 Manhunter series, which I'll correct on the production DB later today). So, based on the dataset it seems the only time it will produce a duplicate value is when a corresponding issue (like a TPB) has the same cover.
With that being said, it might still be a good idea to investigate if increasing the hash_size from the default of 8 to something like 10 or 12 would be beneficial.
With that being said, it might still be a good idea to investigate if increasing the hash_size from the default of 8 to something like 10 or 12 would be beneficial.
It might with the House of Slaughter but I think Crossover would be more difficult. Also, it appears that 8 has become the de facto standard and is going against that worth it for the (seemingly) minimum benefit?
It at least looks possible that most issues might be identifiable with hash alone. Falling back to series, issue list and hamming distance then just going by series, issue number (and any others like year etc.).
My guess is it will still be better to search series & issue number since the API filter on the cover hash will need an exact match. For newer issues it would most likely work filtering on the cover hash, but older issues where the cover images aren't that great, I'd have my doubts how well that work. Anyway, that's something that could be tested.
I'm pretty much done writing this feature, but still need to make a final decision on the hashing method to use (pHash or aHash). Once that is made, I should be able to push this production.
My guess is it will still be better to search series & issue number since the API filter on the cover hash will need an exact match. For newer issues it would most likely work filtering on the cover hash, but older issues where the cover images aren't that great, I'd have my doubts how well that work. Anyway, that's something that could be tested.
My thinking is where there is no issue number (failed filename parse etc.). The issue list for the series could be grabbed including the hash and then hamming could be used etc. That way is possible now but is obviously wasteful (and rude) to the api/website because all the issue covers for that series would need to be downloaded.
My thinking is where there is no issue number (failed filename parse etc.). The issue list for the series could be grabbed including the hash and then hamming could be used etc. That way is possible now but is obviously wasteful (and rude) to the api/website because all the issue covers for that series would need to be downloaded.
Doesn't that primarily happen for things like a TPB or GN that don't have a number one? Regardless, using an issue list for the series seems to be a good solution.
I'm most likely going to merge this tomorrow, since I haven't gotten much feedback regarding the hashing method. Currently, I'm planning to use a pHash, unless I hear a compelling argument before than.
Now your pie-in-sky idea makes much more sense to me as an efficient solution. Having a hash created when an image is uploaded shouldn't be terribly difficult to implement (and hashing existing one would be fairly trivial). A couple of things would need to be decided before implementing, tho:
coverhash
field to theIssue
model in Django would be needed and updating to the save() and delete() methods, otherwise we would probably need to create another DB table to hold this data. My initial thought is just the Primary Cover, since it's easier to implement and I wonder just how many digital comics have a Variant as it's first image.Anyway, I think this is a pretty good idea and it's probably worth opening a Feature Request Bug on it.
Originally posted by @bpepple in https://github.com/bpepple/metron/discussions/164#discussioncomment-5916373