investigate jsdom alternatives

pkra commented 3 years ago

This is really a "whole toolchain" issue as jsdom is used throughout the toolchain. But we have to start here anyway.

With the recent move to MathJax v3, we're experiencing memory issues - we generally have to bump node's available memory to 4GB but even then some papers (!) will run out of memory. jsdom is known to be memory intensive and especially bad at allowing for garbage collection (cf. https://github.com/AmerMathSoc/ams-bookhtml/issues/131).

Some alternative are:

https://github.com/taoqf/node-html-parser
https://github.com/cheeriojs/cheerio
https://github.com/WebReflection/linkedom/
puppeteer / chrome-headless?

pkra commented 3 years ago

Example with linkedom

const {DOMParser, parseHTML} = require('linkedom');

let recurseTheDom = {};
// for npx usage
try {
  recurseTheDom = require(require.resolve('ams-xml-to-html')+'/lib/recurseTheDom');
} catch (e) {
  recurseTheDom = require('./lib/recurseTheDom');
}

const setHead = (xmldoc, htmldoc) => {
  // TODO not in xslt not for articles but added by ams-html; change after switch to JS
  // add viewport meta tag
  const viewportmeta = htmldoc.createElement('meta');
  viewportmeta.setAttribute('name', 'viewport');
  viewportmeta.setAttribute('content', 'width=device-width');
  const isBook = xmldoc.firstElementChild.tagName === 'book'; // TODO extract into property or function?
  if (isBook) htmldoc.head.insertAdjacentElement('afterbegin', viewportmeta);

  // add charset meta tag
  const charset = htmldoc.createElement('meta');
  // TODO switch to modern charset and remove xslt matching code below
  // charset.setAttribute('charset', 'utf-8');
  // matches xslt
  charset.setAttribute('http-equiv', 'Content-Type');
  charset.setAttribute('content', 'text/html; charset=utf-8');
  htmldoc.querySelector('head').insertAdjacentElement('afterbegin', charset);
  // set title
  const xmlTitle =
    xmldoc.querySelector('front>article-meta>title-group>alt-title') ||
    xmldoc.querySelector(
      'book-meta>book-title-group>book-title, front>article-meta>title-group>article-title'
    );
  htmldoc.title = xmlTitle ? xmlTitle.textContent : 'AMS Publication';

  const root = xmldoc.querySelector('article, book');
  const lang = root.getAttribute('xml:lang') || 'en';
  htmldoc.querySelector('html').setAttribute('lang', lang);
};

const xml2html = xmlstring => {
  const xml = (new DOMParser).parseFromString(xmlstring, 'text/xml' ).defaultView;
  const xmldoc = xml.window.document;

  const html = (new DOMParser).parseFromString('<!doctype html> <html lang="en"><head><title></title></head><body></body> </html>', 'text/html').defaultView;
  const htmldoc = html.window.document;

  setHead(xmldoc, htmldoc);

  const root = xmldoc.querySelector('book, article');
  recurseTheDom(htmldoc)(htmldoc.body, root);

  return html;
};

module.exports = xml2html;

if (require.main === module) {
  const fs = require('fs');
  const path = require('path');

  const inputFileName = process.argv[2];
  const outputFileName = process.argv[3];

  const inputString = fs.readFileSync(path.resolve(inputFileName));
  const dom = xml2html(inputString);
  fs.writeFileSync(
    path.resolve(outputFileName),
    '<!DOCTYPE html>\n' + dom.document.toString()
  );
}

This passes all but 11 tests, most of which are about subtitle handling (the rest are false positives due to change in attribute order at serialization).

no document.head => querySelector

pkra commented 3 years ago

While digging a bit further, I found this https://github.com/WebReflection/linkedom/issues/75. Since scope already had issues with jsdom, we don't have as many cases; it's easy to workaround in any case.

pkra commented 3 years ago

Refactoring the three remaining uses of node.querySelector(:scope > bla) to node.children.find(n => n.tagname === 'bla') gets all but two tests to pass, with the remaining two being false positives (changed attribute order in serialized html).

pkra commented 3 years ago

Trying to take this further, integration into ams-html was relatively straight forward but we run into several smaller issues; it's difficult to say if we run into any blockers eventually but it seems worth more investigation.

pkra commented 3 years ago

After some more work, I've got all ams-html tests to pass using linkedom. All but one change would be minor changes that will work with jsdom as well (the one change would be about serialization where linkedom has a different API, i.e., a 1 line change).

Memory usage seems to have improved.

pkra commented 3 years ago

We should also check if we can switch to linkedom there.

pkra commented 3 years ago

Memory usage seems to have improved.

To put some numbers to this: regenerating jams897 from scratch works within the default 2GB memory limit. That's pretty great.

pkra commented 3 years ago

Random note. I've finally identified an issue in linkedom https://github.com/WebReflection/linkedom/issues/83 which I found since insertBefore crashed here:

https://github.com/AmerMathSoc/ams-html/blob/7abc0bd5fcf6632f16e8082643d322ca9d584003/lib/workers/createRef.js#L11-L19

However, that second loop can really just be deleted - time has told that I was overeager. First, there's really no reason to believe that the "main" ref-list would not be the first ref-list. Second, we have only had one paper (jams410, i.e. an old one) with multiple references (though also spec100) which has the main ref-list as first ref-list.

At the same time, a couple of interesting validation issue occur. The parser use by linkedom does not sanitize nested links and it doesn't magically fix some of what jsdom apparently fixed, e.g., bproc20 has graphics without pt units in its dimensions.

Overall, very promising. I'm waiting to see how the developer reacts to see if I want to risk the switch (possibly I'll try to make a PR to fix the insertBefore bug). But so far the necessary changes would be extremely minor, while keeping it easy to roll back to jsdom (or switch to puppeteer).

pkra commented 3 years ago

Random thought: ams-bookhtml requires xhtml output. While jsdom can be told at serialization time, I'm not sure about linkedom.

pkra commented 3 years ago

In other news, https://github.com/WebReflection/linkedom/issues/83 has been fixed. I also like how the developer is pushing back on some PRs that add DOM features they feel don't add anything to their approach.

pkra commented 3 years ago

Random thought: ams-bookhtml requires xhtml output. While jsdom can be told at serialization time, I'm not sure about linkedom.

The following seems to give the desired (xhtml) results

const { DOMParser } = require('linkedom');
const dom = (new DOMParser).parseFromString(`<!DOCTYPE html>
 <html>
   <body></body>
 </html>`, 'text/xml').defaultView;
const document = dom.window.document;
document.querySelector('body').appendChild(document.createElement('img'));
console.log(document.querySelector('html').outerHTML); // includes <img / >

pkra commented 3 years ago

From another test of ams-html (using bproc), visual regressions were clean. However, there's an odd issue with some papers getting pre-load links for an image with empty href value.

pkra commented 3 years ago

From another test of ams-html (using bproc), visual regressions were clean. However, there's an odd issue with some papers getting pre-load links for an image with empty href value.

This seems to be finding a bug -- before, the href value was "undefined" :face_with_head_bandage:

pkra commented 3 years ago

=>https://github.com/AmerMathSoc/ams-html/issues/674

pkra commented 3 years ago

Random thought: ams-bookhtml requires xhtml output. While jsdom can be told at serialization time, I'm not sure about linkedom.

The following seems to give the desired (xhtml) results
const { DOMParser } = require('linkedom');
const dom = (new DOMParser).parseFromString(`<!DOCTYPE html>
 <html>
   <body></body>
 </html>`, 'text/xml').defaultView;
const document = dom.window.document;
document.querySelector('body').appendChild(document.createElement('img'));
console.log(document.querySelector('html').outerHTML); // includes <img / >

True but our tests then largely fail. In part because of some missing APIs in linkedom for XML (e.g., document.head/title). But mostly it seems to be due to to tagnames being case sensitive in HTML (and our tests assume HTML style allcaps tagnames).

pkra commented 3 years ago

in ams-eqnStore, eqnInject would need to workaround the fact that insertAdjacentHTML is not available in linkedom when parsing as xml (or xhtml).

pkra commented 3 years ago

I've run into quite a bit of trouble with ams-bookhtml using linkedom's XML parsing. So much so, that I think I might just keep html parsing and manually fixing void HTML elements.

pkra commented 3 years ago

Just some notes from bookhtml

ams-xml-to-html would need a switch to change the parser and default html string - tests will fail because tagName returns lowercase for xml (as it should, less certain about xhtml)
ams-eqn-store would need to work around insertAdjacentHTML missing for XML and hidden attributes (no values are invalid xml)
ams-bookhtml needs lots of small changes
- deal with insertAdjacentHTML
- deal with tagName casing
- deal with   not being serialized as  
- deal with inline theorems, rewrite image source, headers parsing issues
- and this is just where I stopped for now.

pkra commented 3 years ago

I've run into quite a bit of trouble with ams-bookhtml using linkedom's XML parsing. So much so, that I think I might just keep html parsing and manually fixing void HTML elements.

Trying this out went fairly smoothly. Using changes much like we'd need for ams-html, and then hacking xhtml serialization (temporarily very badly using regexp to close void elements, camelCasing some SVG attribute names, fixing value-less hidden attributes, replacing NBSPs - but the only real regexp would be fixing img tags).

There's still some weird epubcheck error about entities in doctypes that I haven't tracked down yet but this looks like a saner path. (Even if we run into trouble, we could e.g., fallback to jsdom to just parse and re-serialize our html as xhtml).

pkra commented 3 years ago

There's still some weird epubcheck error about entities in doctypes that I haven't tracked down yet but this looks like a saner path. (Even if we run into trouble, we could e.g., fallback to jsdom to just parse and re-serialize our html as xhtml).

This was an older bug that I never spotted. I had added an invalid SVG file

pkra commented 3 years ago

Testing on books, we run into validation issues. stml68 has a fig inside a p which leads to xml-to-html to create a figure inside a p - which is invalid. Either jsdom automagically fixed this in the DOM or at serialization/re-parsing, in any case we now run into this. (SImilarly, there's a neste xref leading to nested a's but that seems like a bug in texml output).

pkra commented 3 years ago

FWIW, stml68 regenerates without visual changes (and only the two minor validation issues).

pkra commented 3 years ago

amstext51 regenerated without any issues (neither validation nor visual changes)

pkra commented 3 years ago

gsm019 needs 3GB memory (sigh - but then again much less than we used to). It also identified another html-vs-xhtml issue (data-eqn-tag attributes need < and > escaping in xhtml but we don't actually use these attributes for ams-bookhtml, only ams-html).

But no visual regressions.

pkra commented 3 years ago

spec100 regenerated without visual changes but a new html vs xhtml issue -- escaping & in href values is necessary in xhtml.

pkra commented 3 years ago

car36 regenerated without visual changes or any other issue.

pkra commented 3 years ago

gsm203 also regenerated without visual change or any other issue.

pkra commented 3 years ago

MCL18 regenerated fine with linkedom (though it's still on MathJax v3; I did check the visual changes but haven't pushed the result.)

pkra commented 3 years ago

To recap, I think we're ready to take the leap and go with linkedom. But I'd like to do a F2F pre-mortem before we do it.

pkra commented 3 years ago

To recap yesterday's F2F discussion.

migration status

scope of change
- ams-xml-to-html requires a few changes
- replacing node.querySelector(':scope > ...') by [...node.children].filter(node => ...) - scope selectors are buggy in jsdom, too so not a big loss.
- tests -- serialization differences [different attribute order]
  - two tests are comparing outerHTML; we can refactor to check innerHTML and attribute values
refactoring ams-html and ams-bookhtml to allow seamless switch
- i.e., xml-to-html returns a DOM-like object. It will simply be a different DOM implementation but ams-html/ams-bookhtml will not require any changes (beyond the refactoring)
ams-html: refactoring was completed, cf. https://github.com/AmerMathSoc/ams-html/issues/670 (two commits refer to this one)
- replacing :scope selectors with children, fixing some selectors (which accidentally worked in jsdom), switching to modern API names, replacing serialization
ams-bookhtml: refactoring being finalized, cf. https://github.com/AmerMathSoc/ams-bookhtml/issues/188
- much of the same but xhtml serialization is a hack right now
visual regression testing with linkedom implementation completed for bproc, btran as well as gsm019, gsm203,
- found some XML issues which didn't show with jsdom (e.g., https://github.com/AmerMathSoc/texml/issues/104, https://github.com/AmerMathSoc/bproc/issues/63)

to do

xml-to-html: refactor before actual switch, move initial DOM creations into module
ams-bookhtml: more visual regression testing, refactor serialization
run visual regression tests for mcom, more books

pre-mortem

potential points of failure

visual regressions - surprising failures when regenerating all publications
- risk: low - bproc and btran cover a lot of ground, mcom should suffice
validation issues - serialization errors causing difficult or brittle workarounds
- risk: low for journals - we have a lot of content already
- risk: medium for books - xhtml serialization is brittle
API limitations - we might be unable to implement new features for lack of APIs
- risk: low - we use very standard APIs
linkedom stops being maintained
- medium: single developer project, smaller community

response to failure

short term failure
- roll back to jsdom
- consider other DOM(-like) libraries
- consider puppeteer (i.e., working in full chromium)
- massively larger dependency but hopefully faster (C++ DOM) and possibly less memory intensive (TBD)
long term failure
- we may have introduced linkedom-specific hacks making it difficult to switch to another option if we encounter
- counterpoint: this switch was fairly easy and identified unintentional hacks that were nevertheless easy to fix

Conclusion

We are in a good place to proceed to switch from jsdom to linkedom. The risks we have identified are manageable and we have several good options to deal with failure both short and long term.

pkra commented 3 years ago

I think we can close this. The investigation has been completed.

AmerMathSoc / texml-to-html