ampproject / amp-toolbox-php

AMP Optimizer PHP library
Apache License 2.0
73 stars 24 forks source link

Port HtmlParser over from amphtml #272

Open schlessera opened 3 years ago

schlessera commented 3 years ago

The validator cannot be built on top of the Dom\Document we are using for the sanitizer and optimizer, as it requires precise line/column/length coordinates for pinpointing validation issues in the source files.

The NodeJS validator uses a SAX parser to traverse the HTML, with the actual validation engine being a handler that gets triggered by the SAX events, i.e. startTag(), endTag(), ...

After looking at existing HTML SAX parsers in PHP, my conclusion is to port over the parser implementation from NodeJS instead of reusing an existing PHP HTML SAX parser, for the following reasons:

westonruter commented 3 years ago

What about using Masterminds? https://github.com/Masterminds/html5-php

Cf. https://github.com/ampproject/amp-wp/issues/3293

schlessera commented 3 years ago

As already discussed in the meeting, Masterminds HtmlParser doesn't fulfil the condition for solving the original issue https://github.com/ampproject/amp-wp/issues/3293, and I'd like to avoid pulling it in as a dependency just for the SAX parser, knowing that there's no guarantee that the parsing will result in the exact same tokenization than the one from the NodeJS library. Given the porting work is not huge, avoiding this dependency seems preferable to me.