Generated XHTML fails (a few) validity checks

nxg commented 3 years ago

The XHTML generated by LaTeXML passes a large fraction of the checks provided by the (very thorough) W3C EPUB validator, but not all of them. The attached stylesheet (see internal comments for notes and rationale) normalises the XHTML so that it passes these checks.

The stylesheet also reworks footnotes into a form which is EPUB-friendly, but more generally is the format recommended by the DAISY accessibility consortium, and I think also implicitly recommended by the XHTML specification (though it's hard to pin down a specific location within the (X)HTML(5) sprawl).

The failures are:

<object> elements with an @alt attribute (the fallback content should be element-content).
both <tbody> and <tr> as children of <table>.
A <meta> content-type indicating XHTML rather than HTML.

The third one is a specifically EPUB issue; the first two are XHTML validity issues. Though this is using an EPUB validator – because in my particular case I'm generating XHTML en route to EPUB – I'm reporting XHTML validity issues which will very probably be relevant for the xhtml output as well. The validity errors are also reported by Emacs nxml-mode, so, again, this isn't an EPUB-specific issue.

Stylesheet: sanitise-xhtml.xslt.gz

xworld21 commented 3 years ago

For reference, I looked at the <meta> issue, I think it comes from the living HTML5 standard [1]. If I understand correctly what it says, content-type is not allowed when using the XML syntax of HTML5, which is what the EPUB spec calls 'XHTML' [2].

~It also looks like the EPUB3 stylesheet is supposed to override the <meta> tag of the webpage stylesheet, but for some reason you are getting the webpage behaviour. Does that point at a bug in the latexmlc code @dginev? In any case, if my reading of the spec is correct, the EPUB3 stylesheet is also non-compliant.~ Edit: if I recall correctly, @nxg is generating the EPUB manually via the XHTML stylesheet, not via latexmlc. It would explain the incorrect behaviour.

I guess that the proper fix is to use USE_NAMESPACES and USE_HTML5 to detect if the output is an HTML5 document in XML syntax, and omit the <meta> tag accordingly (plus remove the EPUB3 override which seems redundant). I am already opening some PRs tweaking the stylesheets, I might as well test this solution and see what the authors think.

[1] https://html.spec.whatwg.org/multipage/semantics.html#attr-meta-http-equiv-content-type [2] https://www.w3.org/publishing/epub32/epub-contentdocs.html#sec-xhtml-conf-content

nxg commented 3 years ago

I have a couple of XSLT scripts which fix up the output produced by v0.8.5, and which generate EPUB OPF and NAV files from that XHTML, which are such that I can get a no-warnings score from the DAISY validator (thus warranting a WCAG 2.0 AA conformance claim). This is using latexml+latexmlpost to generate the XHTML, rather than the EPUB mode of latexmlc (I've nothing against latexmlc, but the former is the toolchain I've been working with; see also issue #1441). One or two of the things I fix up have, I see from above, been already addressed by other commits.

Would it be useful to post the current versions of these scripts here? I doubt they would be directly useful in LaTeXML code, but might be useful for reference. Or would it be useful to create a feature request 'produce WCAG-conforming EPUB output'?

dginev commented 3 years ago

I think both would be at least partially helpful. If we can reuse/get inspired by your upgrades to have latexml start producing WCAG-conforming epub, that's great. And then if we had a way to test that the epub generation continues to be conformant, that's also great, so that we don't lose the newly added upgrade.

And yes, thank you for all the reports and issues, much appreciated - it is often the case that we end up polishing most the bits we use ourselves (HTML5), so it's good to get someone with a keen eye on epub stress test what is generated.

brucemiller commented 3 years ago

Seconding what @dginev said: It would be nice to see the scripts you've used. We'll probably want to adapt and incorporate them: We certainly would prefer to produce valid ePub out of the box (while still allowing customization). And of course, still be able to produce valid, natural xhtml --- When I pondered this Issue before, I was slightly stuck on whether a separate ePub flag was needed in the XSLT, or whether it could be finessed from other flags, as @xworld21 suggested.

nxg commented 3 years ago

Done! I've attached a zip file containing the kit which I've circulated to one or two colleagues: latex-epub-kit-2021-08-12.tar.gz; it has had very minimal usage by anyone other than me. See the README for details. This works with LaTeXML 0.8.5, and is probably very version-specific.

The XHTML here is purely the output of latexmlpost --format=xhtml with very minor fixups (they're arguably buglets in that serialiser anyway, with the exception of the way that footnotes are handled, which appears to be one of the approved HTML5 mechanisms for footnotes rather than anything EPUB-specific). That is, I don't think that a separate EPUB mode or flag is needed.

Here, the metadata that appears in the OPF file is obtained from an XMP file which is generated as step one. I found this a more reliable way of getting such information from the source to the OPF, than trying to smuggle it through the .lxml and .xhtml files. And this means that the XMP file can be incorporated into the OPF metadata, which seems like a Good Thing. There's a significant FIXME in the extract-opf.xslt file, which reflects my uncertainty about some of the details of the relevant scheme.org metadata properties.

The NAV information in the EPUB is scavenged from the structure of the generated XHTML files, so is probably quite sensitive to the details of that serialisation. That obviously wouldn't be an issue if you were generating both at the same time.

I hope this is useful.

nxg commented 3 years ago

I should add – as a postscript, and in case it's not obvious from what I've written above – that I think the current LaTeXML XHTML output is very close to what's required for EPUB.

The sanitisation steps are merely addressing buglets (or nearly so), and the NAV script is generating something it's very simple for LaTeXML to do (and may do already, in the latexmlc variant). The changes or additions here are probably things it would be reasonable to include for any XHTML output, not just that intended for conversion to EPUB.

The OPF step is probably the step that would require (and that required for me) a certain quantum of thought. You of course will have thought through some of these points when producing OPF files already. As specific recommendations:

I think it would be useful to use XMP here. As well as allowing authoritative metadata to come direct from the author to the OPF file, it's probably a good thing to have in an EPUB anyway.
I suspect it'll be necessary to devise a way of having the document author provide detailed configuration of at least some of the information here. I don't think the all of the accessibility metadata is automatically derivable from the document: the fact that there's MathML in the document is of course knowable, but not, for example, the nature of the images (my document that prompted this work had images in it that were semantic – diagrams – alongside images that were not – callout decoration or formatting, and these various image purposes are indistinguishable from each other in code). As noted in the extract-opf.xslt file, it would probably be technically convenient for this sort of metadata to sit outside the main metadata, beside the XMP, and be linked to with an <opf:link> element.

Both of these points, as well as the need for the author to provide some document unique ID, such as a UUID, would seem to me to point towards some sort of simple document.epub.metadata file, managed by the document author.

I'll also point out that the DAISY conformance checker includes a command-line tool which can be invoked at the end of a toolchain to verify that all the checks have passed. And parenthetically (and I'm sure gratuitously), I'll note that although I'm here slightly obsessing over the accessibility features of EPUB, that's far from the only function of the EPUB format.

As a final observation, I'll note that although my document ticks all of the DAISY boxes, and has had a ‘looks OK to me’ from my institution's accessibility coordinator, I haven't yet managed to get any feedback from an actual student user dependent on the assistive elements of the materials. I hope to nag the relevant office to help me out there before long, but to repurpose Knuth's famous remark: ‘Beware of bugs in the above code; I have only proved it correct, not tried it.’

dginev commented 3 years ago

As I'm reading the detailed comments of @nxg above, let me go on the record that you've triggered some red alerts to my own developer preferences.

In particular, the mention of XMP strikes me as a completely unnecessary complication of our current software stack. I tend to be the dev who keeps harping on that we need more simplicity in latexml, both user-facing and in the implementation internals. The risk is really high to make the project unmaintainable due to compounding effects on complexity, unless we keep that in check.

So, while I will consult the code you've written, I will do my best to avoid any novelty introductions to the epub or xhtml generation for as long as I can. Ideally we make some of the current code more reusable across latexml (e.g. manifest-related bookkeeping), and factor out enough until someone can jump into and read + contribute a PR in the matter of minutes.

Just to prepare the expectations of PRs to come for this issue from my side. I've been quite fond of the lightweight approach taken by @xworld21 's PRs, will try to keep things in that spirit.

dginev commented 3 years ago

Much gratitude goes to @xworld21 who has basically solved this issue entirely on his own effort. Very impressive!

I would appreciate an example from @nxg on the issue of both <tbody> and <tr> as children of the same table parent, since that's not immediately obvious to generate. Bruce mentioned a quick suspicion that it could be related to complex equation array constructions in latex, is that the source? If so, there's a suspicion that we can't refactor those on a short notice as they are quite sophisticated and will need testing.

Hence, most of the issue fixes will land already for 0.8.6 thanks to Vincenzo, and the tabular pieces will have to wait for 0.8.7. Thanks again to everyone for the contributions here!

nxg commented 3 years ago

After a bit of digging, and going back to my source revision where I added my fixup, I don't think I can now generate an example of <tbody> and <tr> as children of <table>. I suspect, therefore, that this is another issue which has evaporated with the change from 0.8.4 (which I was using then) to 0.8.5. I can't easily revert to 0.8.4 to confirm (Nix, which is normally very good for that sort of thing, seems to have had a slightly broken 0.8.4 build, now fixed), so we might reasonably assume that's the case and drop that part of the issue.

You can be sure that I'll be able to detect and report any re-emergence, though!

And yes, it seems likely that this was the result of a complicated equation array, both because there were plenty of these in my source, and because I dimly remember that sort of context when I was looking at this.

nxg commented 3 years ago

I think mention of XMP should have anyone reaching for their red flags, including me. I'm not sure what's written on your red flags, @dginev, but I'll suggest that one or two of them might potentially be lowered. The following is a rather ruminative comment – apologies for its length.

First (and just in case it's not clear) I don't anticipate that anything in my attachment above would make it into LaTeXML – it's there just to illustrate what I did to get to my intended destination.

Second, I know you're familiar with the details here – I'm setting them out this way partly to organise my own thoughts.

I think there are three strands here.

Producing correct XHTML. Doing this is a Good Thing independently of whether the XHTML is subsequently bundled into an EPUB. The various patches above, by @dginev and @xworld21, are clearly ticking off the various problems of detail that this strand represents.
Producing EPUB from XHTML. Nothing required here beyond generating the OPF and 'nav' files, and zipping them up with the XHTML. Checked by the W3C epubcheck validator, in a binary valid/not-valid way.
Producing accessible EPUB. This is a set of best practices which apply to both strand 1 and strand 2, and which are reported on by the DAISY validator. WCAG2 AA conformance is the goal here. Going from basic EPUB to WCAG2 EPUB is potentially tricky, because the latter requires a few declarations such as <opf:meta property='schema:accessModeSufficient'>visual</opf:meta> – which declares that the graphics in the document are contentful, such as graphs, rather than merely decorative. This is something that cannot reasonably be deduced from the LaTeX markup, but has to be explicitly stated, somewhere/somehow, by the document author.

It's really only strand 1 that matches the nominal title of this issue; it occurred to me to create feature requests touching on the others, and I could happily do that if that would be useful.

Strands 1 and 2 are, I think, very loosly coupled, and keeping them decoupled seems useful, in an architectural sense. There is some mild coupling in that the process of producing XHTML may or may not leave behind, or carry through from the source document, the extra information required to make strand 2 easy.

Because the best practices of Strand 3 touches on both strands 1 and 2, it provides some mild coupling.

Remark: I'm now certain that I'm going to continue using LaTeXML as my preferred route for generating XHTML+MathML from LaTeX. In practice, however, I'm probably going to continue using my own code to bundle that XHTML into EPUB (for various reasons). Thus for me, the clear blue water between strands 1 and 2 is both natural and valuable. And thus any side-products that make that bundling easy (containing eg TOCs or aggregations of metadata) are of particular interest to me.

So where does XMP come in? Having spent a fair amount of time with XMP and with RDF, I know that XMP is hard to love. It looks ugly, it's fiddly to write, and the XMP spec is in my opinion very poorly written. it does have some advantages, though, which bear on strands 2 and 3.

It appears to be now somewhat standard as a blob of metadata associated with digital objects. The opf:link element mentions xmp as one of its supported properties, and I think this is respected by actual EPUB readers.
Although it's fiddly to write, it's in practice actually quite easy to read in an XSLT context; ie, it's not hard for an XSLT script to extract information from an XMP document.

I don't think that XMP is necessary anywhere in this process. However, when I was doing the XHTML-to-WCAG-EPUB step, it seemed to solve a lot of problems at once.

It's a natural place to put things like the document dc:identifier (typically a UUID). Things put in here can be picked up at the other end of the LaTeX-to-XHTML workflow, without having to be preserved through that workflow.
I want, for example, to have both a nominal document date, and a date of the last repository revision, in the end-result metadata. Stuffing that into the XMP, as part of the workflow, is natural.
Though I haven't done it yet, schema:accessModeSufficient details can go in there too (they're currently hard-wired into my OPF script).
Plus when I'm finished, I can stuff the XMP file into the EPUB metadata, and thus give convenient and standardised access to the same metadata to anyone downstream who's XMP-aware.

There are other ways to do each of these things, of course, including for example \LxDocumentID{urn:uuid:3F8BCB...} and similar. When I was developing my own code, I wrote .sty.lxml code to smuggle such information from the source document to the .xhtml, in such a way that I could re-find it at OPF time, before realising that I could sidestep the problem. When I mentioned XMP in the comment above it was to commend the technique to you (to share the ‘aha!’), rather than to say that XMP was crucial as such. When I offered my code for colleagues to use, I of course made sure that they didn't have to see any XMP (that might have caused alarm and fainting fits).

That is, in my eyes, factoring out all of the project metadata into a single XMP-shaped blob, whether it's put there by the author or by another part of the Makefile, is itself a simplification (admittedly, I have spent a fair amount of time with RDF, so have a fairly high pain threshold where that's concerned). I'll also mention that JSON-LD is part of the family of related formats which are potentially convertible to and from XMP, so that (and I haven't thought about this in any detail) it wouldn't be unreasonable to gather metadata as JSON – potentially a more convenient format – and manipulate it that way. Having arrived there, dumping some XMP to stuff into the OPF becomes merely a final party-trick.

All that said (at length), I'm not here to be dogmatic about your design of your code!

dginev commented 3 years ago

At the very least I genuinely thank you for the detailed examination of the question in the comments here @nxg , truly appreciated.

I think both me and Bruce can take some time to carefully consider what kinds of upgrades are worth investing time and maintenance into, and which directions reap the most benefits for effort invested. There are indeed some existing solutions in latexml that can be made to evolve in various directions, and the ePub support has plenty of room to grow in sophistication... Ideally we can get a lot on the generation side with as little as possible technical investment however. In my experience metadata-related bits can be kept quite compact most of the time, but as usual the devil is in the details.

brucemiller commented 3 years ago

So, I'm thinking that all the subissues here have been addressed, along with giving us some thoughts for future directions. If you find examples that fail, please open a new issue with a minimal test case and we'll look into it.

Thanks for the report, and ideas!

brucemiller / LaTeXML

Generated XHTML fails (a few) validity checks #1440