Open GoogleCodeExporter opened 9 years ago
Original comment by tobiasz....@gmail.com
on 23 Aug 2009 at 5:35
Is there a safe way to prevent phpQuery from altering the document until this
issue
is resolved?
Thanks
Original comment by sni...@gmail.com
on 23 Aug 2009 at 5:58
Didn't check it yet, but as always try to force content type. Anyway i think
it's
definitely in DOMDocumentWrapper class. Maybe try different revisions, can be
possible quick fix for now. Good luck, issue looks serious.
Thanks for report.
Original comment by tobiasz....@gmail.com
on 23 Aug 2009 at 6:34
Forcing content type does not help (newDocumentHTML() results in the same issue
as
above while newDocumentXHTML() results in an error)
Original comment by sni...@gmail.com
on 23 Aug 2009 at 7:31
It's the function charsetFixHTML($markup) which is breaking it.
More specifically, the head is being altered incorrectly in the following
successive
lines:
$markup = substr($markup, 0, $matches[0][1]).substr($markup,
$matches[0][1]+strlen($metaContentType));
$headStart = stripos($markup, '<head>');
$markup = substr($markup, 0, $headStart+6).$metaContentType.substr($markup,
$headStart+6);
For the moment I have bypassed those lines and phpQuery works fine over the set
of
HTML pages I am processing. I am not sure if I am breaking something elsewhere
by
doing this...
Original comment by sni...@gmail.com
on 24 Aug 2009 at 12:29
I've found a safe fix for this, just replace the functions:
protected function charsetFixHTML($markup) {
$matches = array();
// find meta tag
preg_match('@\s*<meta[^>]+http-equiv\\s*=\\s*(["|\'])Content-Type\\1([^>]+?)>@i',
$markup, $matches, PREG_OFFSET_CAPTURE
);
if (! isset($matches[0]))
return;
$metaContentType = $matches[0][0];
$markup = substr($markup, 0, $matches[0][1])
.substr($markup, $matches[0][1]+strlen($metaContentType));
$headStart = preg_match("/<head([\w\W]+?)>/i",$markup,$captures, PREG_OFFSET_CAPTURE);
if ($headStart > 0) {
$headStart = $captures[0][1];
$offset = strlen($captures[0][0]);
$markup = substr($markup, 0, $headStart+$offset).$metaContentType
.substr($markup, $headStart+$offset);
}
return $markup;
}
protected function charsetAppendToHTML($html, $charset, $xhtml = false) {
// remove existing meta[type=content-type]
$html = preg_replace('@\s*<meta[^>]+http-equiv\\s*=\\s*(["|\'])Content-Type\\1([^>]+?)>@i', '', $html);
$meta = '<meta http-equiv="Content-Type" content="text/html;charset='
.$charset.'" '
.($xhtml ? '/' : '')
.'>';
if (strpos($html, '<head') === false) {
return $meta.$html;
} else {
return preg_replace(
'@<head(.*?)(?(?<!\?)>)@s',
'<head\\1>'.$meta,
$html
);
}
}
Original comment by gabriel...@gmail.com
on 9 May 2012 at 4:46
Thanks for report and gabriel's suggested patch.
I applied this to 0.9.5, and while it did solve the issue, my phpquery code
which ran fine before, stopped working. I haven't yet had time to analyse why,
but I don't think the patch is mature enough yet.
I'll look into this and post findings here.
Original comment by tom.law...@whitesitewebsites.co.uk
on 13 Mar 2013 at 1:57
Original issue reported on code.google.com by
sni...@gmail.com
on 22 Aug 2009 at 11:11