Improve EPUB 3 Content Documents chunking

bertfrees commented 6 years ago

@rdeltour said (March 13, 2018):

Currently the *-to-epub3 scripts chunk content pretty naïvely (based on top-level sections). We should try to improve that:

inspect the structure more finely, to chunk based on lower-level sections
possibly chunk based on content size (see https://github.com/daisy/pipeline/issues/351)

bertfrees commented 6 years ago

This is how the chunking is currently implemented in:

daisy202-to-epub3: No chunking. The EPUB contains as many HTML documents as there are in the DAISY 2.02.
daisy3-to-epub3: Uses dtbook-to-zedai and zedai-to-html.
dtbook-to-epub3: Uses dtbook-to-zedai and zedai-to-epub3
dtbook-to-html: Uses dtbook-to-zedai and zedai-to-html.
epub3-to-epub3: No chunking
html-to-epub3: No chunking. The script accepts multiple input documents.
zedai-to-html: Chunks up the HTML with http://www.daisy.org/pipeline/modules/html-utils/html-chunker.xsl (not enabled in the script).
zedai-to-epub3: the ZedAI is converted into a single HTML which is then chunked up with http://www.daisy.org/pipeline/modules/html-utils/html-chunker.xsl

This is how html-chunker.xsl currently works:

Top-level sections are unwrapped and put in their own chunk. Attributes are moved to the body element.
Everything in between two top-level sections is put in its own chunk.
Child sections of a top-level "bodymatter" section (with epub:type "bodymatter") are put in their own chunk.
Everything in between two child sections of a "bodymatter" section is put in its own chunk.

bertfrees commented 6 years ago

We can probably just enhance the html-chunker step and use it in some more places (e.g. html-to-epub3 and epub3-to-epub3). We should start by wrapping the XSLT in an XProc step. If it's too complicated to implement the improved chunking in XSLT only, we can consider adding some Java.

bertfrees commented 6 years ago

I think it depends on the exact requirements whether we want to keep the main part of the implementation in XSLT, and use Java only to do some size calculations, or whether we want to implement the splitting algorithm in Java and possibly use XSLT to do the actual chunking.

Volume breaking in braille needed to be very advanced and configurable and therefore had to be implemented in Java. HTML chunking probably doesn't need to be that configurable (only a maximum size in kB) but still the chunking algorithm might not be trivial if we need to weigh off several variables against each other: preferred break points, maximum size, evenness of chunk sizes, etc.

An important question is how aware users will be of the chunking. Will users even notice if a section is split at a random place?

bertfrees commented 6 years ago

I suggest we create a generic px:chunk step that takes a document, a stylesheet URL and some other options and returns a sequence of documents. The style sheet is specific to the input format and should contain a set of matchers that somehow specify the break point opportunities. The benefit of this added complexity is that we can start out with a simple XSLT implementation of the step and easily move to Java if needed, while still keeping some of the "flexibility" of XSLT through the style sheet. In theory we could even support CSS, similar to how we do it for volume breaking in braille.

px:chunk-html could then be implemented as a px:chunk call with an HTML stylesheet, followed by a cleanup step that does some wrapping and unwrapping of elements.

@rdeltour @josteinaj thoughts?

bertfrees commented 6 years ago

@josteinaj Any specific requirements from NLB?

josteinaj commented 6 years ago

@bertfrees I'll discuss it a bit with internally.

We're working with single-HTML documents as our master format now, and they have (or at least will have) this structure:

<html>
    <head>...</head>
    <body>
        <section><!-- chapter 1 --></section>
        <section><!-- chapter 2 --></section>
        <section><!-- chapter 3 --></section>
    </body>
</html>

My initial thoughts are that each of these top-level sections should become separate HTML files. At least, that was our intention when using this structure. We would maybe be interested in splitting on number of bytes or similar, to prevent too big files from occuring, but it should be possible to disable such behavior as well, and only split on top-level section elements, so that we get a predictable output.

It would probably make sense to split the files in a way that preserves the HTML5 structural outline. Not sure what implications that would have, if any.

Our structure (with only section elements allowed as top-level elements) is of course not a generic structure, so for a generic splitter, you'd need additional logic (probably by performing the HTML5 outlining algorithm?). In the future we'll try to force EPUBs "from the wild" to conform to our grammar, and maybe this generic chunking mechanism could help us with wrapping generic content into a series of section elements; we'll see.

id-attributes should be preserved when splitting.

We might need to split a SMIL file alongside the HTML file in the future. And preferably also the MP3 files. Splitting SMIL and MP3 files would probably be separate steps though, and should be relatively straight forward when the IDs are preserved.

bertfrees commented 6 years ago

Thanks. I've pushed a first version so you get an idea of how it works.

josteinaj commented 6 years ago

Neat. So we would just provide our own html-chunker-break-points.xsl with our own f:is-chunk?

bertfrees commented 6 years ago

That's right.

bertfrees commented 5 years ago

See PR: https://github.com/daisy/pipeline-scripts/pull/149

daisy / pipeline-scripts

Improve EPUB 3 Content Documents chunking #123