anansi-project / comicinfo

ComicInfo.xml's new home
https://anansi-project.github.io/docs/category/comicinfo
MIT License
136 stars 8 forks source link

New element: Metadata_ID #50

Open killo3967 opened 1 year ago

killo3967 commented 1 year ago

Where does this comes from?

Because when scrapping the metadata we could find the data in differents webs. Now comicRack put in in "tags" and in personal fields that are not stored in the xml but in the database. Calibre has a filed named "ID" for this purpose and could be various data from different sources.

What is the rationale for adding support for this element?

If i scrap a comic i coud have metadata from comicvine, bedetheque, amazon etc. And of course for manga also, and i would like to have the source of each site.

Is the element already handled by any application or tool?

Like i said before, "Calibre" has it.

gotson commented 1 year ago

Isn't that the purpose of the Web element to store the url of where the metadata came from?

lordwelch commented 1 year ago

At least for ComicTagger the Web element refers to the user facing webpage and may not contain the ID. ComicVine happens to also store the id in the url but that is purely an accident. Mylar for example typically parses the ID from the notes field

killo3967 commented 1 year ago

I forgot memtion one of the main ID, the ISBN. This and the other could be usefull to find duplicates easyly and with a high realibility.

majora2007 commented 1 year ago

I'm in favor of an addition like this. A simple way to achieve could be:

<MetadataId>anilist:32346,cv:dko35235,mal:45345</MetadataId>

This stays in the same scheme that is used in the existing fields, allows free form input so the schema doesn't need to add something new each and every time someone wants X source. Could even change to shorthand_source(url):id, but i think web handles that already.

gotson commented 1 year ago

I forgot memtion one of the main ID, the ISBN. This and the other could be usefull to find duplicates easyly and with a high realibility.

This is already handled by the new GTIN element.

killo3967 commented 1 year ago

I have read the discussion about GTIN and I have seen some problems:

There are old comics that don't even have an ISBN.

That is why I propose having one/several fields depending on the site that classifies them, that contains the ID or the web of the page that has information about the comic.

And as I made clear in the name of the "Metadata ID" field, the information from where I obtained the metadata would be stored here.

ajslater commented 1 year ago

There are a few concepts here which I think may be useful for this discussion:

Online Metadata Database Identification Numbers:

Trade Identification Numbers: 

The ComicInfo.xml Web Field

This field is meant for URLs, but the most common URLS are ComicVine URLs which are derivable from the CVDB number.  e.g.: https://comicvine.gamespot.com/arbitrary-slug/4000-1234567 The "Web" field is from ComicRack and according to the compatibility guidelines there should be only one entry.

The ComicInfo.xml GTIN Field

is new from the Anansi Project. I think is currently limited to one entry by the spec.

Proposed Resolutions

Ideal

In an ideal world there might be an <Identifier type="CVDB">1234567890</Identifier> field that could have any number of entries and be used for ComicVine numbers, GTIN, ISBN, ASIN, etc. As many as you like. And your client software could easily derive web links for each of them as the respective url formats are all simple.

Simple

We already have the GTIN field, I'd suggest altering the ComicInfo.xml spec to formally allow more than one GTIN entry and overloading it to handle the general concept of identifier. If you wanted to get fancy you could add a type attribute, but people are already encoding type with alphanumeric prefixes so <GTIN>CVDB1234567890</GTIN> is fine and easy to decode by both humans and software.

gotson commented 1 year ago

<GTIN>CVDB1234567890</GTIN> this is not a valid GTIN though.

In Readium WebPub Manifest the identifiers are URI, and use the URN format, like urn:isbn:9783161484100.

ajslater commented 1 year ago

That's true. I was suggesting abusing the GTIN field for other means, which is not ideal.

I'm glad you mentioned readium using the urn format. i was unaware of it. I'll use it for a multi-format metadata reader/writer i'm working on.

ajslater commented 1 year ago

Because I see Notes fields that look like:

Tagged by Comictagger 1.3.1 on 1970-01-01T12:12:00 [Issue ID 1234567890]  [CMXDB45678] [CVDB1234567890] [ASINBC09876]

I'm fairly convinced there should be:

  1. An Identifier field of some sort that supports multiple possible metadata ids. It feels to me like GTIN is a subset of all possible identifiers, and also a useful superset of most trade identifiers. But having GTIN be for trade identifiers and another field for metadata id's would also work.
  2. A <Tagger /> field that tells you what program and version wrote the metadata. Analogous to ComicBookInfo "appID" JSON. In PDF this field is called <pdf:Producer/>.
  3. An <UpdatedAt /> or "lastModified" field like ComicBookInfo JSON has. Not entirely necessary because filesystems also have timestamps in inodes, but this is specific to the tagging action.

Only (1.) is relevant to this discussion. But since people are forcing it into Notes already it seems like it would be used. For Codex's own internal metadata database I'm going to be parsing the Notes field myself for this information.

killo3967 commented 9 months ago

I propose that GTIN be a calculated identifier. For example, you can calculate the "phash" (proportional image hash) of each image in the comic, add the values ​​of all the images and use the result as GTIN. Taking into account that when calculating the phash, each image is converted to black and white and reduced to an 8x8 binary matrix, the calculation is quite fast. Info about phash: https://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html Python library: https://pypi.org/project/ImageHash/

I am using this to compare the images of the comics and it gives a reliability of 99.9%.

lordwelch commented 9 months ago

@killo3967 GTIN is a specific type of identifier with it's own spec https://www.gs1.org/standards/id-keys/gtin we are not going to abuse it for that purpose.

As for using a perceptual hash as a Metadata ID that's not a great idea as the hash will change by a non-trivial amount if you simply choose a different language to calculate it by, even using the same library you can get different results.

Pillow (what ImageHash uses) for webp uses libwebp which has a minimum of 3 different ways to load the image (into RGB because Pillow always loads into RGB) the primary way that it uses ends up being the "fancy upsampling" methods which is the most complicated but has decidedly different hashes than any of the other methods in the libwebp library, let alone another library. To top it all off webp stores it's color in yuv format in the yuv color format we only need to use the 'y' not the 'uv' and then we have the grayscale image needed for the hash so if we are only calculating hashes we don't need to even do a "conversion", this hash also different from the other hashes by loading the image into RGB. Guess what? resizing the image before the grayscale conversion vs after also results in a different hash. Most of these hashes that are generated are within the same arbitrary hamming distance that we could pick but they are not the same hash which as an ID is unacceptable for them not to be the same when dealing with the same exact set of bytes.

As a comparison of cover images it works well, as a method to search for a comic it also works ok, as an identifier for a comic not great.

hammerandtongs commented 7 months ago

Proposed Resolutions

Ideal

In an ideal world there might be an <Identifier type="CVDB">1234567890</Identifier> field that could have any number of entries and be used for ComicVine numbers, GTIN, ISBN, ASIN, etc. As many as you like. And your client software could easily derive web links for each of them as the respective url formats are all simple.

This seems like the best choice overall.

I'd really like to see this added to anansi as I'd like to use the spec as a lightweight metadata added to cbz with a primary goal being matching to multiple online (and local in my case) sources for more elaborate metadata (ie stay lightweight locally).

majora2007 commented 7 months ago

I'm in support of something like <Identifier type="CVDB">1234567890</Identifier> as @ajslater brought up.

killo3967 commented 3 months ago

Proposed Resolutions

Ideal

In an ideal world there might be an <Identifier type="CVDB">1234567890</Identifier> field that could have any number of entries and be used for ComicVine numbers, GTIN, ISBN, ASIN, etc. As many as you like. And your client software could easily derive web links for each of them as the respective url formats are all simple.

This seems like the best choice overall.

I'd really like to see this added to anansi as I'd like to use the spec as a lightweight metadata added to cbz with a primary goal being matching to multiple online (and local in my case) sources for more elaborate metadata (ie stay lightweight locally).

Sound the best solution for me. I'm agree with your idea.

killo3967 commented 1 week ago

Resoluciones propuestas

Ideal

En un mundo ideal podría haber una <Identifier type="CVDB">1234567890</Identifier> campo que podría tener cualquier cantidad de entradas y usarse para números de ComicVine, GTIN, ISBN, ASIN, etc. Tantos como desee. Y su software cliente podría derivar fácilmente enlaces web para cada uno de ellos, ya que los formatos de URL respectivos son todos simples.

Esta parece la mejor opción en general. Realmente me gustaría ver esto agregado a anansi, ya que me gustaría usar la especificación como metadatos livianos agregados a cbz con el objetivo principal de hacer coincidir múltiples fuentes en línea (y locales en mi caso) para obtener metadatos más elaborados (es decir, mantenerse liviano a nivel local).

Suena como la mejor solución para mí. Estoy de acuerdo con tu idea.

I re-think about the solution and i think that the GTIN could be a multidata field that could have data from diferent sources.
For example:

url -> From ComicVine url -> From Bedetheque url -> From isbnsearch.org or other url -> From Tebeosfera url -> From Amazon/Comixology etc... Why this: 1.- Because all sites don't have all the information about all comics. 2.- Because all the people are not English speaker. There are a lot of comics in othe languages. 3.- Because not all the people use CVDB as main scrape, the are a lot of others webs with diferent id's data. 4.- Because Calibre do this during years and works fine with all users. For Example one book could has this ID's: isbn:9788490623527 barnesnoble:w/2010-arthur-c-clarke/1111814622 google:jtQeBAAAQBAJ