dino- / epub-tools

Command line utilities for working with epub files
ISC License
32 stars 3 forks source link

epubmeta doesn't export some metadata #19

Open kingjon3377 opened 3 years ago

kingjon3377 commented 3 years ago

Even with -v, the output of epubmeta path/to/ebook.epub does not include some pieces of metadata.

For my particular use-case (identifying which ebooks downloaded from AO3 are in a series) I find that the information I want is in <meta tags with the name field's value in the calibre namespace.

For example, the OPF for the EPUB version of this story includes this element:

<meta name="calibre:series" content="New Hope"/>

So when I run epubmeta on it without -e (and ideally without -v), I would like to see a line like series: New Hope

dino- commented 3 years ago

These epub tools are designed for working with EPUB metadata that sits inside the <manifest> section of an EPUB's manifest.xml document.

EPUB metadata is defined by these schemas:

EPUB2: http://idpf.org/epub/20/spec/OPF_2.0.1_draft.htm
EPUB3: http://www.idpf.org/epub/30/spec/epub30-publications.html

The tag you're describing, meta name="calibre:series" is not part of the specification of these book file formats. It's something added by the 3rd-party book management tool Calibre and isn't recognized by IDPF as part of the EPUB standard. I dont' know for sure but am guessing that this has been done by Calibre to an EPUB2 book which has no way to store this type of series information.

The right thing to do here (although I find it unlikely it will be done en-masse to existing EPUB2s) is for books to be converted to the EPUB3 format. This format has a very different and flexible system of allowing the definition of several title tags that are marked up with refinements. For example, one refinement is called title-type and could be set to collection which is like the case you're describing, a series of books that belong together).

If you're curious, here's the relevant specification for the title tagging for EPUB3: http://idpf.org/epub/30/spec/epub30-publications.html#sec-opf-dctitle

To be honest, my software isn't even parsing this level of complexity out of EPUB3 because hardly anyone was using it back when I was putting EPUB3 support in. It would be great if all this and more were added to epub-tools some day.

There are tools out there for converting from EPUB2 to EPUB3. I wonder if, when Calibre is given an EPUB3 file, it will fill in the correct tags for books in a series.

In the meantime, I'd this issue should stay open because this all definitely needs better EPUB3 support.

kingjon3377 commented 3 years ago

That's a reasonable perspective to take. However, a quick scan through the EPUB2 spec finds this passage:

One or more optional instances of a meta element, analogous to the XHTML 1.1 meta element but applicable to the publication as a whole, may be placed within the metadata element or within the deprecated x-metadata element. This allows content providers to express arbitrary metadata beyond the data described by the Dublin Core specification.

The <meta name="calibre:series" tag is within the <metadata> tag.

(It looks like the Archive Of Our Own is using Calibre tooling to generate downloadable ebooks on demand, as the one I pointed to has this in its NCX: <meta content="calibre (3.39.1)" name="dtb:generator"/>.)

(And FWIW, when Calibre converts this EPUB to an EPUB3, the series information is still represented using <meta tags within the <metadata> element, but now using a form that's not calibre: namespaced and that more closely resembles what's shown in the EPUB3 spec:

    <meta property="belongs-to-collection" id="id-3">New Hope</meta>
    <meta refines="#id-3" property="collection-type">series</meta>
    <meta refines="#id-3" property="group-position">2</meta>

But still not included in the default output of epubmeta.)

Basically, what I expected was for the default output of epubmeta to include some representation of everything under the <metadata> element that I would see if I used epubmeta -e. Special handling of common-extension cases like series name and position (e.g. turning calibre:series Series Name and calibre:series_index 1 into series: Series Name #1) might be nice to have, but certainly isn't necessary so long as information from <meta> tags isn't simply ignored.

dino- commented 3 years ago

Ok, you've convinced me. I think this should be done at some point. Plus I should fill out more of the EPUB3 refinements that are in the spec. It requires support in the epub-metadata library so depends now on https://github.com/dino-/epub-metadata/issues/12