Closed pkra closed 3 years ago
Example with linkedom
const {DOMParser, parseHTML} = require('linkedom');
let recurseTheDom = {};
// for npx usage
try {
recurseTheDom = require(require.resolve('ams-xml-to-html')+'/lib/recurseTheDom');
} catch (e) {
recurseTheDom = require('./lib/recurseTheDom');
}
const setHead = (xmldoc, htmldoc) => {
// TODO not in xslt not for articles but added by ams-html; change after switch to JS
// add viewport meta tag
const viewportmeta = htmldoc.createElement('meta');
viewportmeta.setAttribute('name', 'viewport');
viewportmeta.setAttribute('content', 'width=device-width');
const isBook = xmldoc.firstElementChild.tagName === 'book'; // TODO extract into property or function?
if (isBook) htmldoc.head.insertAdjacentElement('afterbegin', viewportmeta);
// add charset meta tag
const charset = htmldoc.createElement('meta');
// TODO switch to modern charset and remove xslt matching code below
// charset.setAttribute('charset', 'utf-8');
// matches xslt
charset.setAttribute('http-equiv', 'Content-Type');
charset.setAttribute('content', 'text/html; charset=utf-8');
htmldoc.querySelector('head').insertAdjacentElement('afterbegin', charset);
// set title
const xmlTitle =
xmldoc.querySelector('front>article-meta>title-group>alt-title') ||
xmldoc.querySelector(
'book-meta>book-title-group>book-title, front>article-meta>title-group>article-title'
);
htmldoc.title = xmlTitle ? xmlTitle.textContent : 'AMS Publication';
const root = xmldoc.querySelector('article, book');
const lang = root.getAttribute('xml:lang') || 'en';
htmldoc.querySelector('html').setAttribute('lang', lang);
};
const xml2html = xmlstring => {
const xml = (new DOMParser).parseFromString(xmlstring, 'text/xml' ).defaultView;
const xmldoc = xml.window.document;
const html = (new DOMParser).parseFromString('<!doctype html> <html lang="en"><head><title></title></head><body></body> </html>', 'text/html').defaultView;
const htmldoc = html.window.document;
setHead(xmldoc, htmldoc);
const root = xmldoc.querySelector('book, article');
recurseTheDom(htmldoc)(htmldoc.body, root);
return html;
};
module.exports = xml2html;
if (require.main === module) {
const fs = require('fs');
const path = require('path');
const inputFileName = process.argv[2];
const outputFileName = process.argv[3];
const inputString = fs.readFileSync(path.resolve(inputFileName));
const dom = xml2html(inputString);
fs.writeFileSync(
path.resolve(outputFileName),
'<!DOCTYPE html>\n' + dom.document.toString()
);
}
This passes all but 11 tests, most of which are about subtitle handling (the rest are false positives due to change in attribute order at serialization).
While digging a bit further, I found this https://github.com/WebReflection/linkedom/issues/75. Since scope already had issues with jsdom, we don't have as many cases; it's easy to workaround in any case.
Refactoring the three remaining uses of node.querySelector(:scope > bla)
to node.children.find(n => n.tagname === 'bla')
gets all but two tests to pass, with the remaining two being false positives (changed attribute order in serialized html).
Trying to take this further, integration into ams-html was relatively straight forward but we run into several smaller issues; it's difficult to say if we run into any blockers eventually but it seems worth more investigation.
After some more work, I've got all ams-html tests to pass using linkedom. All but one change would be minor changes that will work with jsdom as well (the one change would be about serialization where linkedom has a different API, i.e., a 1 line change).
Memory usage seems to have improved.
Somewhat related: https://github.com/AmerMathSoc/ams-eqn-store/issues/48
We should also check if we can switch to linkedom there.
Memory usage seems to have improved.
To put some numbers to this: regenerating jams897 from scratch works within the default 2GB memory limit. That's pretty great.
Random note. I've finally identified an issue in linkedom https://github.com/WebReflection/linkedom/issues/83 which I found since insertBefore crashed here:
However, that second loop can really just be deleted - time has told that I was overeager. First, there's really no reason to believe that the "main" ref-list would not be the first ref-list. Second, we have only had one paper (jams410, i.e. an old one) with multiple references (though also spec100) which has the main ref-list as first ref-list.
At the same time, a couple of interesting validation issue occur. The parser use by linkedom does not sanitize nested links and it doesn't magically fix some of what jsdom apparently fixed, e.g., bproc20 has graphics without pt
units in its dimensions.
Overall, very promising. I'm waiting to see how the developer reacts to see if I want to risk the switch (possibly I'll try to make a PR to fix the insertBefore bug). But so far the necessary changes would be extremely minor, while keeping it easy to roll back to jsdom (or switch to puppeteer).
Random thought: ams-bookhtml requires xhtml output. While jsdom can be told at serialization time, I'm not sure about linkedom.
In other news, https://github.com/WebReflection/linkedom/issues/83 has been fixed. I also like how the developer is pushing back on some PRs that add DOM features they feel don't add anything to their approach.
Random thought: ams-bookhtml requires xhtml output. While jsdom can be told at serialization time, I'm not sure about linkedom.
The following seems to give the desired (xhtml) results
const { DOMParser } = require('linkedom');
const dom = (new DOMParser).parseFromString(`<!DOCTYPE html>
<html>
<body></body>
</html>`, 'text/xml').defaultView;
const document = dom.window.document;
document.querySelector('body').appendChild(document.createElement('img'));
console.log(document.querySelector('html').outerHTML); // includes <img / >
From another test of ams-html (using bproc), visual regressions were clean. However, there's an odd issue with some papers getting pre-load links for an image with empty href value.
From another test of ams-html (using bproc), visual regressions were clean. However, there's an odd issue with some papers getting pre-load links for an image with empty href value.
This seems to be finding a bug -- before, the href value was "undefined" :face_with_head_bandage:
Random thought: ams-bookhtml requires xhtml output. While jsdom can be told at serialization time, I'm not sure about linkedom.
The following seems to give the desired (xhtml) results
const { DOMParser } = require('linkedom'); const dom = (new DOMParser).parseFromString(`<!DOCTYPE html> <html> <body></body> </html>`, 'text/xml').defaultView; const document = dom.window.document; document.querySelector('body').appendChild(document.createElement('img')); console.log(document.querySelector('html').outerHTML); // includes <img / >
True but our tests then largely fail. In part because of some missing APIs in linkedom for XML (e.g., document.head/title). But mostly it seems to be due to to tagnames being case sensitive in HTML (and our tests assume HTML style allcaps tagnames).
in ams-eqnStore, eqnInject would need to workaround the fact that insertAdjacentHTML is not available in linkedom when parsing as xml (or xhtml).
I've run into quite a bit of trouble with ams-bookhtml using linkedom's XML parsing. So much so, that I think I might just keep html parsing and manually fixing void HTML elements.
Just some notes from bookhtml
not being serialized as  
I've run into quite a bit of trouble with ams-bookhtml using linkedom's XML parsing. So much so, that I think I might just keep html parsing and manually fixing void HTML elements.
Trying this out went fairly smoothly. Using changes much like we'd need for ams-html, and then hacking xhtml serialization (temporarily very badly using regexp to close void elements, camelCasing some SVG attribute names, fixing value-less hidden attributes, replacing NBSPs - but the only real regexp would be fixing img tags).
There's still some weird epubcheck error about entities in doctypes that I haven't tracked down yet but this looks like a saner path. (Even if we run into trouble, we could e.g., fallback to jsdom to just parse and re-serialize our html as xhtml).
There's still some weird epubcheck error about entities in doctypes that I haven't tracked down yet but this looks like a saner path. (Even if we run into trouble, we could e.g., fallback to jsdom to just parse and re-serialize our html as xhtml).
This was an older bug that I never spotted. I had added an invalid SVG file
Testing on books, we run into validation issues. stml68 has a fig inside a p which leads to xml-to-html to create a figure inside a p - which is invalid. Either jsdom automagically fixed this in the DOM or at serialization/re-parsing, in any case we now run into this. (SImilarly, there's a neste xref leading to nested a's but that seems like a bug in texml output).
FWIW, stml68 regenerates without visual changes (and only the two minor validation issues).
amstext51 regenerated without any issues (neither validation nor visual changes)
gsm019 needs 3GB memory (sigh - but then again much less than we used to). It also identified another html-vs-xhtml issue (data-eqn-tag attributes need <
and >
escaping in xhtml but we don't actually use these attributes for ams-bookhtml, only ams-html).
But no visual regressions.
spec100 regenerated without visual changes but a new html vs xhtml issue -- escaping & in href values is necessary in xhtml.
car36 regenerated without visual changes or any other issue.
gsm203 also regenerated without visual change or any other issue.
MCL18 regenerated fine with linkedom (though it's still on MathJax v3; I did check the visual changes but haven't pushed the result.)
To recap, I think we're ready to take the leap and go with linkedom. But I'd like to do a F2F pre-mortem before we do it.
To recap yesterday's F2F discussion.
node.querySelector(':scope > ...')
by [...node.children].filter(node => ...)
- scope selectors are buggy in jsdom, too so not a big loss. We are in a good place to proceed to switch from jsdom to linkedom. The risks we have identified are manageable and we have several good options to deal with failure both short and long term.
I think we can close this. The investigation has been completed.
This is really a "whole toolchain" issue as jsdom is used throughout the toolchain. But we have to start here anyway.
With the recent move to MathJax v3, we're experiencing memory issues - we generally have to bump node's available memory to 4GB but even then some papers (!) will run out of memory. jsdom is known to be memory intensive and especially bad at allowing for garbage collection (cf. https://github.com/AmerMathSoc/ams-bookhtml/issues/131).
Some alternative are: