Open schlessera opened 3 years ago
What about using Masterminds? https://github.com/Masterminds/html5-php
As already discussed in the meeting, Masterminds HtmlParser doesn't fulfil the condition for solving the original issue https://github.com/ampproject/amp-wp/issues/3293, and I'd like to avoid pulling it in as a dependency just for the SAX parser, knowing that there's no guarantee that the parsing will result in the exact same tokenization than the one from the NodeJS library. Given the porting work is not huge, avoiding this dependency seems preferable to me.
The validator cannot be built on top of the
Dom\Document
we are using for the sanitizer and optimizer, as it requires precise line/column/length coordinates for pinpointing validation issues in the source files.The NodeJS validator uses a SAX parser to traverse the HTML, with the actual validation engine being a handler that gets triggered by the SAX events, i.e.
startTag()
,endTag()
, ...After looking at existing HTML SAX parsers in PHP, my conclusion is to port over the parser implementation from NodeJS instead of reusing an existing PHP HTML SAX parser, for the following reasons: