ampproject / amp-toolbox-php

AMP Optimizer PHP library
Apache License 2.0
74 stars 23 forks source link

Uncaught `DomException` in `Document::moveInvalidHeadNodesToBody` #522

Closed swissspidy closed 2 years ago

swissspidy commented 2 years ago

When trying to parse https://chuseok.org/places-to-visit-during-chuseok-mid-autumn-summer/ with Document::fromHtml(), I get an uncaught DomException:

PHP Fatal error:  Uncaught DOMException: Not Found Error in /vendor/ampproject/amp-toolbox/src/Dom/Document.php:553

This is the offending code:

https://github.com/ampproject/amp-toolbox-php/blob/89cd1f6bad3512fb8a08aae5fb6aedc64e988392/src/Dom/Document.php#L655-L669

PHPStorm warns about accessing $node potentially causing a null pointer exception too.

This fixes the bug for me:

-while ($node) {
+while ($node !== null) {
ediamin commented 2 years ago

@swissspidy I could not reproduce the error. Could you please post a snippet that could raise the error? I tried the following code,

        $request = new CurlRemoteGetRequest();
        $response = $request->get('https://chuseok.org/places-to-visit-during-chuseok-mid-autumn-summer/');

        $document = Document::fromHtml($response->getBody());
        $document->saveHTML();
swissspidy commented 2 years ago

Oh, I forgot that in our code base we remove everything after the closing </head> before parsing, for performance reasons (we don't need the body).

Basically:

$html = $response->getBody();
$html_head_end = stripos( $html, '</head>' );
if ( $html_head_end ) {
$html = substr( $html, 0, $html_head_end );
}

So I guess that explains it. There's no body.

ediamin commented 2 years ago

@swissspidy In your html, the closing head tag is missing. I guess you can avoid the error by including the closing head tag. Despite of that, I've sent a PR to handle this kind of scenario. I hope it'll fix your error.

swissspidy commented 2 years ago

Ah I see!

Well, #524 looks good but apparently it doesn't fully resolve this issue in our case. I forgot that we also only fetch the first 150 KB of the document to work on that (i.e. wp_remote_get( $url, [ 'limit_response_size' => 153600 ] )).

When trying to parse that, there can still be a DOMException in moveInvalidHeadNodesToBody unfortunately.

Apparently not even my original while ($node !== null) { fix helps with that. Not sure why I didn't catch this before.

I think for our use case we still have to wrap the Document::fromHtml call in a try {} catch {} just to be safe.