Open bertfrees opened 7 years ago
A possibility is to start from Jostein's nlbdev/nordic-epub3-dtbook-migrator project. Basically it exists of a dtbook-to-epub3 part and a epub3-to-dtbook part. We would only pick the dtbook-to-epub3 part. The DTBooks that this script support currently have to follow a strict schema ("Nordic DTBook"). The EPUBs that this script produces follow a strict schema as well ("Nordic EPUB3"). The idea is of course to make it support "any" DTBook.
The Nordic version of the script could then possibly build on the generic version of the script. One way to approach this would be to split up the Nordic dtbook-to-epub3 into a "nordic-dtbook-to-dtbook" step followed by a more generic "dtbook-to-epub3" step (and maybe a "epub3-to-nordic-epub3"?). Another approach would be to have a "nordic-dtbook-to-html.xsl" extend a more generic "dtbook-to-html.xsl". In either case, it's not clear yet whether the "Nordic" part will need to add much to the generic part.
Some more ideas/remarks/questions:
The current (Nordic) version of the dtbook-to-epub3 script does not only do the conversion, but also validates input (DTBook), output (EPUB) and intermediary formats (HTML). I'm not sure whether the generic script needs this as well.
The generic script can be made more reusable by moving all non-trivial stuff (everything except connecting ports, passing options, etc.) from the top-level step to the substeps.
We could split up dtbook-to-epub3 into a ".load", ".convert" and ".store" part to allow chaining.
Why does the script currently not use dtbook-load?
DTBook can have some redundancies, how to handle these? Some examples:
imgref
vs. longdesc
: two ways of linking a caption
or prodnote
with an img
. What to do when they are both used?
<caption imgref="x">...</caption>
<img id="x" .../>
vs.
<caption id="x">...</caption>
<img longdesc="#x" .../>
depth
attribute on list
: Why does it exist? Just ignore it?
NLB makes a lot of use of classes while in some cases it would be better (more semantically correct) to use epub:type="nordic:..."
. The generic script should encourage the use of epub:type
.
NLB doesn't use some DTBook elements such as dfn
but replaces this with <span class='definition'>
. I'm not sure why they do this (I think Jostein has explained before but I forgot) but it goes without saying that the generic dtbook-to-epub3 script should assume the input is full regular DTBook. Also, if I understand correctly this is NLB-specific and not Nordic-specific, so in fact there ought to be at least 3 different dtbook-to-pef scripts.
I should also note that there is an SBS branch of nordic-epub3-dtbook-migrator. Because one of the goals of that branch is to make it less "Nordic"-specific, some commits might be useful for making the generic script.
@josteinaj commented:
Several DTBook elements are disallowed due to MTM-specific DTBook rules.
Classes are used when no official epub:type
(now role
) was available at the time the converter was written. We could've created namespaced epub:type
s, but there didn't seem to be any real benefit. Reading systems would be very unlikely to adopt NLBs namespaced types, and CSS rules are somewhat easier to write using classes.
The script does use dtbook-load; it's part of the validation step: https://github.com/nlbdev/nordic-epub3-dtbook-migrator/blob/master/src/main/resources/xml/xproc/step/dtbook-validate.step.xpl#L226
The Nordic version of the script could then possibly build on the generic version of the script.
In order to illustrate the two approaches I have this simple example:
In the Nordic dtbook-to-epub3, the DTBook element
<span class="answer">...</span>
is converted to the HTML element
<span epub:type="answer">...</span>
Now let's assume that in the generic version we only want to give special meaning to DTBook classes if they have a special form. The DTBook input would then for example have to be:
<span class="epub-answer">...</span>
The generic step would have this XSLT template:
<xsl:template match="dtb:span/@class">
<xsl:choose>
<xsl:when test="matches(.,'^epub-')">
<xsl:attribute name="epub:type" select="replace(.,'^epub-','')"/>
</xsl:when>
<xsl:otherwise>
<xsl:sequence select="."/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
The Nordic version could then either add a preprocessing step that converts class="example"
to class="epub-example"
(approach 1):
<xsl:variable name="special-classes" select="('answer')"/>
<xsl:template match="dtb:span/@class">
<xsl:choose>
<xsl:when test=".=$special-classes">
<xsl:attribute name="class" select="concat('epub-',.)"/>
</xsl:when>
<xsl:otherwise>
<xsl:sequence select="."/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
or it could "xsl:import" the generic XSLT and override the template (approach 2):
<xsl:variable name="special-classes" select="('answer')"/>
<xsl:template match="dtb:span/@class">
<xsl:choose>
<xsl:when test=".=$special-classes">
<xsl:attribute name="epub:type" select="."/>
</xsl:when>
<xsl:otherwise>
<xsl:sequence select="."/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
Reading systems would be very unlikely to adopt NLBs namespaced types, and CSS rules are somewhat easier to write using classes.
Right, I remember now. Well, reading systems won't do anything useful (and correct!) with random classes either I think. The CSS argument: OK. But if we encounter situations like this in the generic script in the Pipeline we should also think about adoption of standards.
The script does use dtbook-load.
OK, my bad!
Update:
The current status is that we have two options: either we adapt the Nordic migrator script, which ingests a strict subset of DTBook at the moment, or we adapt the DTBook to XHTML script from Pipeline 1, which currently outputs XHTML 1.0 (HTML 4, but we need HTML 5 for EPUB 3).
In order to estimate how much adapting work is needed it is important to understand the input restrictions that the Nordic migrator has. This mapping document explains how the conversion works (might not be 100% accurate anymore). It also mentions how the conversion should work, had the script been generic. The Nordic DTBook profile is documentated in the markup guidelines (linked from bottom of this page). There are also some undocumented restrictions on the input format in RelaxNG and Schematron. Some of these may be simply discarded, some may require some work before they can be removed.
Jostein is currently working on porting the Nordic migrator from 1.9 to 1.10.
How much adapting work the Pipeline 1 path would require is not clear yet either. It seems there isn't a mapping available like the Nordic migrator has. This is all the documentation I could find: http://www.daisy.org/projects/pipeline/doc/scripts/DTBookToXhtml.html. Marisa hasn't dug deep yet.
From a quick look at the Pipeline 1 XSLT, Jostein thinks adapting the XSLT from the Nordic migrator would be easier. Marisa is also leaning that way.
Marisa should be able to have another look at this in October after Ace beta comes out.
Some more thoughts...
Does it make sense to look at the DAISY 2.02 to EPUB 3 script as a starting point?
After we're done with this new DTBook to EPUB 3 script, should we also consider reimplementing DTBook to ZedAI as DTBook to EPUB 3 + EPUB 3 to ZedAI? Because the idea of piping scripts together to create various other ones is still good, right? And if EPUB 3 is going to be our new "central" format of choice, then having a EPUB 3 to ZedAI script enables us to create various x to ZedAI scripts via EPUB 3. Or is this a stupid question?
Note that DAISY 2.02 to EPUB 3 is one of the first scripts that was implemented and I don't think it follows all the conventions of newer scripts, so I don't know how well suited it is for chaining/reuse.
Going from EPUB to ZedAI (or DTBook for that matter) would mean that we need to "normalize" the EPUB in some way according to a stricter schema before converting to ZedAI/DTBook. That would be a useful step for other scripts as well, I've been thinking about implementing something like that for NLB so that we can use commercial EPUBs directly in our production lines (commercial EPUB -> normalized EPUB -> nordic EPUB).
Yes that's what I was thinking as well. I don't think we can guarantee a lossless conversion, so we should make sure to give good warnings about things that might cause issues in the reports.
Hi @josteinaj
It took a bit longer than planned, but I'm finally going to start with the new dtbook-to-epub3 (which will be based on the nordic-epub3-dtbook-migrator).
This is what I have in mind:
Does that make sense?
Sounds good. Note that nordic-dtbook-to-epub3 is just a wrapper around nordic-dtbook-to-html and nordic-html-to-epub3. So "html-to-epub3" could be invoked as part of the conversion. Otherwise, there will essentially be two separate html-to-epub3 conversions: the existing official one exposed as a script, and then the component inside dtbook-to-epub3 which would be based on the nordic migrator.
OK thanks. I forgot we already have a html-to-epub3. So I will create a replacement for three modules then: dtbook-to-html, html-to-epub3 and dtbook-to-epub3. I have updated the comment above. Is your html-to-epub3 in any way based on the old html-to-epub3? Do you also perform the HTML upgrade and cleanup steps?
Regarding the two approaches for the extension mechanism (extendable XSLT stylesheets vs. pre- and post-conversions from/to Nordic formats to generic formats), which of the two is most appropriate according to you?
The nordic-html-to-epub3 is not (as far as I remember) based on the old html-to-epub3. It mainly performs splitting into separate HTML files based on the navigation document, and bundling it into an EPUB container. There's no HTML upgrade/cleanup (the dtbook-to-epub3 script has some cleanup to handle agency specific legacy markup though). The single-HTML representation contains epub:type attributes and is as such not really a "pure" HTML representation (in case that matters).
I think I'd prefer pre-/post-conversions over extending the XSLT stylesheets.
I looked into the status of this issue.
In 2020, work was started to refactor the DTBook to EPUB 3 part of nordic-epub3-dtbook-migrator. At around the same time, or a bit later, the EPUB 3 to DAISY 3 script was developed, also based on nordic-epub3-dtbook-migrator. This resulted in this huge refactoring PR and this "EPUB to DAISY improvements" PR in pipeline-modules. I also did a refactoring PR in nordic-epub3-dtbook-migrator. The same kind of refactoring that was done in the EPUB 3 to DTBook part of nordic-epub3-dtbook-migrator was planned for the DTBook to EPUB 3 part. I have a lot of local branches and stashes containing work in progress, but after some digging it seems it doesn't contain any significant work that was not merged yet in one of the abovementioned PRs.
In 2021, "priority was given to more critical work items", and we never continued to work on the issue.
I.e. without the intermediary ZedAI step.