anansi-project / comicinfo

ComicInfo.xml's new home
https://anansi-project.github.io/docs/category/comicinfo
MIT License
143 stars 8 forks source link

New Element: GTIN #12

Closed shimizurei closed 1 year ago

shimizurei commented 2 years ago

GTIN (Global Trade Item Number) is comprised of UPC, EAN, and JAN codes. Useful for gleaning more metadata from store sources or other sources that use it as a unique identifier.

ISBN is under the EAN namespace (same with ISSN if magazines (like Shonen Jump) are to be considered).

gotson commented 2 years ago

I was under the impression that all books have an ISBN, so i would understand the need for an ISBN field, but can you clarify or give context about the other ones, and how they relate to books ?

shimizurei commented 2 years ago

The Jujutsu Kaisen Key Animation Vol. 1 book has a code 4580575583611. It's not a standard ISBN. (According to this site, it's a GTIN-14 code.) This seems to be a common thing with all the books I've bought straight from the animation studio that are never set up for formal print runs outside the limited studio run.

The code for the book IDOLiSH7 KEY ANIMATIONS Vol. 1 is 185159584001. Per the website I used to check if something is an ISBN, it's not possible for this code to be an ISBN. Yet, it's the only unique identifier on the book. It's another limited edition book only published by the anime studio and nowhere else.

Also, I can't say how other countries do it, as I am an American, and I have a limited knowledge of non-American systems, but EANs originate from Europe and JANs are Japanese. I feel like GTIN is all-inclusive.

If you use this site, you can find who owns a certain GTIN if it's not in ISBN format: https://gepir.gs1.org/index.php/search-by-gtin

Has info on ISBNs and ISSNs: https://www.gs1.org/sites/default/files/docs/epc/GS1_EPC_TDS_i1_12.pdf

gotson commented 2 years ago

Thanks for the context and explanations.

What kind of use cases would this field enable exactly?

shimizurei commented 2 years ago

It increases the ability for any consuming application that uses this Comicinfo.xml file to enhance their metadata should they programmatically choose to do so.

Basically, it's a simple way to glean more info/metadata, with one simple, unique identifier, thereby futureproofing and promoting extensibility to other consuming applications.

gotson commented 2 years ago

That covers quite well the consuming side of things, but who/what would fill that data in ComicInfo.xml in the first place?

shimizurei commented 2 years ago

How is the data usually filled out, other than with ComicTagger or some other application/script?

I've always edited my own files with my own scripts since, like I mentioned, Comic Vine has nothing for me (same with Manga-Tagger). Manga-Tagger relies on AniList and MAL, but those take so long for edited entries to be approved. I actively edit for AL, while I left MAL when it died for a few months about 5-7 years ago, so I know about the editing process. Sometimes my entries will languish for months, since the stuff I'm into isn't popular, until I prod a data mod. Mangaupdates has a very fast approval system since I've been editing there for years now. I haven't created a comicinfo file in a while, but I used to just use the Mangaupdates ID to glean the initial stuff (summary, title, author, artist), then use the ISBN I input to get high quality covers and other metadata.

Also, other less-than-legal sources put GTIN/ISBNs in their file names, so they can read from there. Also, any metadata app that uses barcode scanners.

In your comment here, you stated

The problem i see here is that you are manually editing files, without knowing what those fields mean. Ideally the files should be filled by a program (be it a manga tagger or whatever really) that would do this job for you to figure out what to fill, for example in the case of manga to fill both penciller and inker.

Right now, there is no such option (for my needs at least), so I have to manually create my own comicinfo.xml files. That's why I feel like something like the GTIN/ISBN could be used to provide data from other data sources that Comic Vine or MAL could not via scripting and scraping. That way any consuming application can ingest the properly prepped data file consistently.

gotson commented 2 years ago

I didn't quite get it, but you would be adding the GTIN field manually in the files, is that correct?

shimizurei commented 2 years ago

Yes.

gotson commented 2 years ago

Thanks, that clarifies the workflow part, which seems pretty valid to me.

I didn't go into details into the different fields, but do we need to know in advance whether the content of the field would be a particular code, like EAN or JAN or ISBN ?

Or is it something that is embedded in the GTIN, and given any GTIN, it should be fairly doable (probably via a library of some kind) to determine what kind of code that is ?

shimizurei commented 2 years ago

GTINs are validated using a check digit. I figured whatever consuming app that used them would just use any old database capable of looking up UPCs and ISBNs. (I think Calibre can use barcodes?) There should be an established method of using them in any language since barcodes are so widely used.


Background Info on the GTIN

A GTIN (Global Trade Item Number, pronounced Gee-Tin) is assigned to a product by GS1. This is a code identifying any business unit (consumer unit or unit standard grouping) in an international and unique way and is usually below a UPC barcode symbol.

The GTIN can contain 8, 12, 13 or 14 digits, and can be constructed using four structures, depending on the application:

(Codification GTIN-12 is included in coding GTIN-13 by the addition of a 0 in the first position.)

The last digit of a barcode number is a computer check digit which makes sure the barcode is correctly composed. Here is a check digit calculator. This page shows how to calcuate a check digit manually (which would be helpful to a consuming application if it wanted to calculate it vs using the GTIN in a search engine.)

From this SO comment, "ISBN-10 can be converted to ISBN-13 which is equivalent to EAN / GTIN-13. Why: ISBN-10 is modulo 11 and as such uses the letter 'X' as a possible check digit to represent the number 10." More info on ISBN-10 to ISBN-13 in this comment.

I hope this clarifies things a bit.

References:

BlobCodes commented 2 years ago

@shimizurei

The Jujutsu Kaisen Key Animation Vol. 1 book has a code 4580575583611. It's not a standard ISBN. (According to this site, it's a GTIN-14 code.)

No, the last digit is a check digit. If you remove it and paste the code into the check digit calculator again, it returns the correct check digit and shows that it is a GTIN-13. The code you have given is not an ISBN but a JAN (reference: in jancode database) - so it's an EAN with a japanese prefix (45 or 49).

The code for the book IDOLiSH7 KEY ANIMATIONS Vol. 1 is 185159584001. Per the website I used to check if something is an ISBN, it's not possible for this code to be an ISBN. Yet, it's the only unique identifier on the book. It's another limited edition book only published by the anime studio and nowhere else.

Inserting this code into the check digit calculator while removing the last digit actually does not return the right digit. This code is not even a GTIN. Searching for this code on https://google.co.jp brought up only two results:

As you can see in the url, the seller on rakuten is actually also just suruga-ya.jp. One very noticable thing is that the first 9 digits of your given code are inside the surugaya url. Further inspection shows that these codes on surugaya follow a very specific scheme:

I'd say this is probably just a code used internally in the database of surugaya, just like amazon has their ASIN codes and ebay has their ePID codes. It begins with 185 because of the category, has the 6-digit item id 159584 and ends with 001 because that's always appended to create 12-digit codes. This book probably does not have any numerical identifier (which is probably the reason why all other shops selling this product do not have this code on their product pages).

jiquera commented 1 year ago

I'd be in favor of having an ISBN/EAN/GTIN tag. At the moment the way to link a comic book with an actual release is using the web tag. However, this depends on the availability if the item in that particular shop. It would be nice to be able to uniquely identify which version and edition it is.

On the workflow topic: yes currently it is manually, but the ease-of-use of a unique identifier are very interesting from a tool perspective as described above. So I can imagine other workflows surrounding this tag be adopted easily (not by the dead-tools obviously).

From what i understand EAN technically refers to the barcode format. GTIN refers to a super generic code applicable to any product and ISBN to a subset of GTIN specifically meant for book-like items. So I'm guessing a GTIN tag would be most appropriate... although ISBN would feel more intuitive for me but might exclude certain items. Note that if you google ISBN 4580575583611 it finds plenty shops that treat it as an ISBN.

gotson commented 1 year ago

I think we all agree that it would be a nice addition.

I'm wondering what a good implementation would look like. Here are a couple ideas:

  1. <GTIN>...</GTIN> element. Since GTIN is a superset of all other identifiers, the GTIN name would be fairly explicit (instead of something like Identifier). We would also not need any kind of attribute to tell what kind of identifier it is. The consuming applications would need some kind of library to validate GTIN or find out what kind of identifier it is (USBN for example).
  2. <Identifier type="ISBN">somenumber</Identifier>, as proposed here. The problem i see with this is the type attribute. To be useful, it should be an enum, but listing all possible values would make the schema more heavier, and would need adjustments if ever a new identifier type would need to be added. There is also a risk that people would want to stuff other kind of identifiers in that field (because Identifier is quite generic), for example ComicVine ID, or Mangadex ID.

As such, i would be in favor of Option 1.

@ajslater @majora2007 @lordwelch would you have some comment on this, so hopefully we could close this soon? Thanks

lordwelch commented 1 year ago

I think we should go with option 1. Option 2 needs some more thought on how to implement it and I think it can wait until the target model gets some more traction for that.

There seem to be plenty of libraries available for gtin validation.

For ComicInfo I would recommend only soft validation on the gtin tag in consumers, like a warning on a details page, or on an edit details page, that they are not using it correctly if it is not a valid gtin.

jiquera commented 1 year ago

Ack, maybe do also spent time to write a decent paragraph of documentation since the differences between all the identifiers are definitely not known by everyone.

This incidentally goes for several other fields as well: it would be nice to have some intended use examples for every field. Just to help newbies.

majora2007 commented 1 year ago

Is my understanding correct that an GTIN is a separate number than an ISBN or is it that an ISBN fits inside the GTIN schema?

L-A-Sutherland commented 1 year ago

A 13 digit ISBN is a GTIN that is using the country code 978 or 979. It is possible to convert a 10 digit ISBN into a 13 digit ISBN. You should also be able to convert a 13 digit ISBN into a 10 digit ISBN if it begins with 978.

jiquera commented 1 year ago

From the GTIN wiki:

`The GTIN standard has incorporated the International Standard Book Number (ISBN), International Standard Serial Number (ISSN), International Standard Music Number (ISMN), International Article Number (which includes the European Article Number and Japanese Article Number) and some Universal Product Codes (UPCs), into a universal number space.

...

All books and serial publications sold internationally (including those in U.S. stores) have GTIN (GTIN-13) codes. The book codes are either constructed by prefixing the old 10-digit ISBN with 978, and recalculating the trailing check digit, or from 1 January 2007 issued as thirteen digits starting with 978 (eventually 979 as the 978 ranges are used up).`

So an ISBN-13 = GTIN-13 as far as I understand it

majora2007 commented 1 year ago

So then a GTIN can be converted by the consuming application into ISBN or EAN? I'm just wondering if there is any need to specify that it is a EAN or ISBN in the XML or if a GTIN format can be enforced so that consumers can translate them to ISBN or EAN.

gotson commented 1 year ago

An ISBN is a GTIN.

A GTIN is not necessarily an ISBN.

There's no translation needed, but consuming applications would need to check what kind of identifier the GTIN is (if processing is required, for display only it shouldn't be necessary). There are libraries that validate an ISBN for example (using the check digit and the length), which one could use with the GTIN value.

L-A-Sutherland commented 1 year ago

So then a GTIN can be converted by the consuming application into ISBN or EAN? I'm just wondering if there is any need to specify that it is a EAN or ISBN in the XML or if a GTIN format can be enforced so that consumers can translate them to ISBN or EAN.

The consuming application can identify if a GTIN is an ISBN by checking if the first 3 digits of the GTIN (the country code) is equal to "978" or "979". It would be redundant to specify that information in the XML.

gotson commented 1 year ago

Now that I think of it, should the schema allow for multiple GTINs?

I would think yes, but happy to hear about what you people think.

majora2007 commented 1 year ago

This is what I was getting at. Why not GTIN type="ISBN" and the others. This makes it super easy on consumers and super flexible for taggers.

L-A-Sutherland commented 1 year ago

I don't think the schema needs to support a comic book having multiple GTINs. When would a single book have more than one? Are there any known examples?


I can see how a having a type attribute would let consumers save the complexity from including code to evaluate the GINT type. I think taggers should calculate the type in that case. What would knowing the exact type of a GINT be used for in a consumer application?

majora2007 commented 1 year ago

My only assumption for why GTINs are used in consuming application is for metadata fetching. I can't see why a reading server like Kavita or Komga would care if it's ISBN, GTIN-13, or GTIN-14. So by defining what exactly the GTIN is, the metadata program has much less overhead.

L-A-Sutherland commented 1 year ago

In that case I am leaning towards not storing the type as an attribute. if it's only being used as part of metadata fetching then I don't think the overhead would be significant.

gotson commented 1 year ago

IMO there's no point storing the type, because from a GTIN you can know what subtype it is easily. Having an attribute would bring confusion, as you could have incorrect data like type="ISBN" but the value could be an UPC code. The schema itself cannot enforce that kind of checks.

If you want to do metadata fetching, you would use the value in the GTIN field, simple as that. If a metadata source has different fields for different identifiers, you could either try them all, or find out what kind of GTIN you have and try only the correct fields.

gotson commented 1 year ago

I don't think the schema needs to support a comic book having multiple GTINs. When would a single book have more than one? Are there any known examples?

I can't think of any known example either, so we could go for a max occurrence of 1 in the schema.

gotson commented 1 year ago

This is what I was getting at. Why not GTIN type="ISBN" and the others. This makes it super easy on consumers and super flexible for taggers.

For taggers it's more complex, because they need to know what kind of type to put. And having an enumeration for the type attribute will only add overhead.

It's much simpler for taggers to put the value in the GTIN field, and be done with it.

For consumers, as you mentioned, most would only display it without particular treatment, in which case you don't care about the type.

For Komga i plan to do the same thing i do for the ISBN field in the epub metadata, which is to validate the value through a ISBN validation library, and if it's a valid ISBN, store it in Komga's metadata model. If invalid, ignore it. With a GTIN i can use the same logic, and that would only keep valid ISBN, without having to care about what GTIN type it is.

gotson commented 1 year ago

I have created #39, will wait a few days for any other review/comment on the PR itself before merging