daisy / pipeline-scripts

!! NOTE: This project is now part of the pipeline-modules project !! | Script modules for the default DAISY Pipeline 2 distribution.
GNU Lesser General Public License v3.0
6 stars 5 forks source link

"Direct" DTBook to EPUB 3 #101

Open bertfrees opened 7 years ago

bertfrees commented 7 years ago

I.e. without the intermediary ZedAI step.

bertfrees commented 7 years ago

A possibility is to start from Jostein's nlbdev/nordic-epub3-dtbook-migrator project. Basically it exists of a dtbook-to-epub3 part and a epub3-to-dtbook part. We would only pick the dtbook-to-epub3 part. The DTBooks that this script support currently have to follow a strict schema ("Nordic DTBook"). The EPUBs that this script produces follow a strict schema as well ("Nordic EPUB3"). The idea is of course to make it support "any" DTBook.

The Nordic version of the script could then possibly build on the generic version of the script. One way to approach this would be to split up the Nordic dtbook-to-epub3 into a "nordic-dtbook-to-dtbook" step followed by a more generic "dtbook-to-epub3" step (and maybe a "epub3-to-nordic-epub3"?). Another approach would be to have a "nordic-dtbook-to-html.xsl" extend a more generic "dtbook-to-html.xsl". In either case, it's not clear yet whether the "Nordic" part will need to add much to the generic part.

Some more ideas/remarks/questions:

bertfrees commented 7 years ago

@josteinaj commented:

Several DTBook elements are disallowed due to MTM-specific DTBook rules.

Classes are used when no official epub:type (now role) was available at the time the converter was written. We could've created namespaced epub:types, but there didn't seem to be any real benefit. Reading systems would be very unlikely to adopt NLBs namespaced types, and CSS rules are somewhat easier to write using classes.

The script does use dtbook-load; it's part of the validation step: https://github.com/nlbdev/nordic-epub3-dtbook-migrator/blob/master/src/main/resources/xml/xproc/step/dtbook-validate.step.xpl#L226

bertfrees commented 7 years ago

The Nordic version of the script could then possibly build on the generic version of the script.

In order to illustrate the two approaches I have this simple example:

In the Nordic dtbook-to-epub3, the DTBook element

<span class="answer">...</span>

is converted to the HTML element

<span epub:type="answer">...</span>

Now let's assume that in the generic version we only want to give special meaning to DTBook classes if they have a special form. The DTBook input would then for example have to be:

<span class="epub-answer">...</span>

The generic step would have this XSLT template:

<xsl:template match="dtb:span/@class">
    <xsl:choose>
        <xsl:when test="matches(.,'^epub-')">
            <xsl:attribute name="epub:type" select="replace(.,'^epub-','')"/>
        </xsl:when>
        <xsl:otherwise>
            <xsl:sequence select="."/>
        </xsl:otherwise>
    </xsl:choose>
</xsl:template>

The Nordic version could then either add a preprocessing step that converts class="example" to class="epub-example" (approach 1):

<xsl:variable name="special-classes" select="('answer')"/>

<xsl:template match="dtb:span/@class">
    <xsl:choose>
        <xsl:when test=".=$special-classes">
            <xsl:attribute name="class" select="concat('epub-',.)"/>
        </xsl:when>
        <xsl:otherwise>
            <xsl:sequence select="."/>
        </xsl:otherwise>
    </xsl:choose>
</xsl:template>

or it could "xsl:import" the generic XSLT and override the template (approach 2):

<xsl:variable name="special-classes" select="('answer')"/>

<xsl:template match="dtb:span/@class">
    <xsl:choose>
        <xsl:when test=".=$special-classes">
            <xsl:attribute name="epub:type" select="."/>
        </xsl:when>
        <xsl:otherwise>
            <xsl:sequence select="."/>
        </xsl:otherwise>
    </xsl:choose>
</xsl:template>
bertfrees commented 7 years ago

Reading systems would be very unlikely to adopt NLBs namespaced types, and CSS rules are somewhat easier to write using classes.

Right, I remember now. Well, reading systems won't do anything useful (and correct!) with random classes either I think. The CSS argument: OK. But if we encounter situations like this in the generic script in the Pipeline we should also think about adoption of standards.

The script does use dtbook-load.

OK, my bad!

bertfrees commented 7 years ago

Update:

The current status is that we have two options: either we adapt the Nordic migrator script, which ingests a strict subset of DTBook at the moment, or we adapt the DTBook to XHTML script from Pipeline 1, which currently outputs XHTML 1.0 (HTML 4, but we need HTML 5 for EPUB 3).

In order to estimate how much adapting work is needed it is important to understand the input restrictions that the Nordic migrator has. This mapping document explains how the conversion works (might not be 100% accurate anymore). It also mentions how the conversion should work, had the script been generic. The Nordic DTBook profile is documentated in the markup guidelines (linked from bottom of this page). There are also some undocumented restrictions on the input format in RelaxNG and Schematron. Some of these may be simply discarded, some may require some work before they can be removed.

Jostein is currently working on porting the Nordic migrator from 1.9 to 1.10.

How much adapting work the Pipeline 1 path would require is not clear yet either. It seems there isn't a mapping available like the Nordic migrator has. This is all the documentation I could find: http://www.daisy.org/projects/pipeline/doc/scripts/DTBookToXhtml.html. Marisa hasn't dug deep yet.

From a quick look at the Pipeline 1 XSLT, Jostein thinks adapting the XSLT from the Nordic migrator would be easier. Marisa is also leaning that way.

Marisa should be able to have another look at this in October after Ace beta comes out.

bertfrees commented 7 years ago

Some more thoughts...

Does it make sense to look at the DAISY 2.02 to EPUB 3 script as a starting point?

After we're done with this new DTBook to EPUB 3 script, should we also consider reimplementing DTBook to ZedAI as DTBook to EPUB 3 + EPUB 3 to ZedAI? Because the idea of piping scripts together to create various other ones is still good, right? And if EPUB 3 is going to be our new "central" format of choice, then having a EPUB 3 to ZedAI script enables us to create various x to ZedAI scripts via EPUB 3. Or is this a stupid question?

josteinaj commented 7 years ago

Note that DAISY 2.02 to EPUB 3 is one of the first scripts that was implemented and I don't think it follows all the conventions of newer scripts, so I don't know how well suited it is for chaining/reuse.

Going from EPUB to ZedAI (or DTBook for that matter) would mean that we need to "normalize" the EPUB in some way according to a stricter schema before converting to ZedAI/DTBook. That would be a useful step for other scripts as well, I've been thinking about implementing something like that for NLB so that we can use commercial EPUBs directly in our production lines (commercial EPUB -> normalized EPUB -> nordic EPUB).

bertfrees commented 7 years ago

Yes. I was thinking about some global normalized format that can be converted to any other format. Not sure if that's what you had in mind?

Here is a drawing:

Not sure it makes sense.

Could you maybe add a drawing next to it if you had something else in mind?

josteinaj commented 6 years ago

Yes that's what I was thinking as well. I don't think we can guarantee a lossless conversion, so we should make sure to give good warnings about things that might cause issues in the reports.

bertfrees commented 5 years ago

Hi @josteinaj

It took a bit longer than planned, but I'm finally going to start with the new dtbook-to-epub3 (which will be based on the nordic-epub3-dtbook-migrator).

This is what I have in mind:

Does that make sense?

josteinaj commented 5 years ago

Sounds good. Note that nordic-dtbook-to-epub3 is just a wrapper around nordic-dtbook-to-html and nordic-html-to-epub3. So "html-to-epub3" could be invoked as part of the conversion. Otherwise, there will essentially be two separate html-to-epub3 conversions: the existing official one exposed as a script, and then the component inside dtbook-to-epub3 which would be based on the nordic migrator.

bertfrees commented 5 years ago

OK thanks. I forgot we already have a html-to-epub3. So I will create a replacement for three modules then: dtbook-to-html, html-to-epub3 and dtbook-to-epub3. I have updated the comment above. Is your html-to-epub3 in any way based on the old html-to-epub3? Do you also perform the HTML upgrade and cleanup steps?

Regarding the two approaches for the extension mechanism (extendable XSLT stylesheets vs. pre- and post-conversions from/to Nordic formats to generic formats), which of the two is most appropriate according to you?

josteinaj commented 5 years ago

The nordic-html-to-epub3 is not (as far as I remember) based on the old html-to-epub3. It mainly performs splitting into separate HTML files based on the navigation document, and bundling it into an EPUB container. There's no HTML upgrade/cleanup (the dtbook-to-epub3 script has some cleanup to handle agency specific legacy markup though). The single-HTML representation contains epub:type attributes and is as such not really a "pure" HTML representation (in case that matters).

I think I'd prefer pre-/post-conversions over extending the XSLT stylesheets.

bertfrees commented 10 months ago

I looked into the status of this issue.

In 2020, work was started to refactor the DTBook to EPUB 3 part of nordic-epub3-dtbook-migrator. At around the same time, or a bit later, the EPUB 3 to DAISY 3 script was developed, also based on nordic-epub3-dtbook-migrator. This resulted in this huge refactoring PR and this "EPUB to DAISY improvements" PR in pipeline-modules. I also did a refactoring PR in nordic-epub3-dtbook-migrator. The same kind of refactoring that was done in the EPUB 3 to DTBook part of nordic-epub3-dtbook-migrator was planned for the DTBook to EPUB 3 part. I have a lot of local branches and stashes containing work in progress, but after some digging it seems it doesn't contain any significant work that was not merged yet in one of the abovementioned PRs.

In 2021, "priority was given to more critical work items", and we never continued to work on the issue.