Closed swissspidy closed 2 years ago
@swissspidy I could not reproduce the error. Could you please post a snippet that could raise the error? I tried the following code,
$request = new CurlRemoteGetRequest();
$response = $request->get('https://chuseok.org/places-to-visit-during-chuseok-mid-autumn-summer/');
$document = Document::fromHtml($response->getBody());
$document->saveHTML();
Oh, I forgot that in our code base we remove everything after the closing </head>
before parsing, for performance reasons (we don't need the body).
Basically:
$html = $response->getBody();
$html_head_end = stripos( $html, '</head>' );
if ( $html_head_end ) {
$html = substr( $html, 0, $html_head_end );
}
So I guess that explains it. There's no body.
@swissspidy In your html, the closing head tag is missing. I guess you can avoid the error by including the closing head tag. Despite of that, I've sent a PR to handle this kind of scenario. I hope it'll fix your error.
Ah I see!
Well, #524 looks good but apparently it doesn't fully resolve this issue in our case. I forgot that we also only fetch the first 150 KB of the document to work on that (i.e. wp_remote_get( $url, [ 'limit_response_size' => 153600 ] )
).
When trying to parse that, there can still be a DOMException
in moveInvalidHeadNodesToBody
unfortunately.
Apparently not even my original while ($node !== null) {
fix helps with that. Not sure why I didn't catch this before.
I think for our use case we still have to wrap the Document::fromHtml
call in a try {} catch {}
just to be safe.
When trying to parse
https://chuseok.org/places-to-visit-during-chuseok-mid-autumn-summer/
withDocument::fromHtml()
, I get an uncaughtDomException
:This is the offending code:
https://github.com/ampproject/amp-toolbox-php/blob/89cd1f6bad3512fb8a08aae5fb6aedc64e988392/src/Dom/Document.php#L655-L669
PHPStorm warns about accessing
$node
potentially causing a null pointer exception too.This fixes the bug for me: