diacainiao / phpquery

Automatically exported from code.google.com/p/phpquery
0 stars 0 forks source link

phpQuery altering content incorrectly #129

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?

The following code utilizing this specific URL will result in incorrect
behavior.

$link = "http://fr.php.net/function.strpos";
$page = file_get_contents($link);
$page = phpQuery::newDocument($page);

What is the expected output? What do you see instead?
-The original $page before phpQuery::newDocument() is correctly formatted
while the final $page contains an incorrectly included doctype. 

--ORIGINAL PAGE AFTER file_get_contents()--
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
                      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head profile="http://purl.org/NET/erdf/profile">
 <title>PHP: strpos - Manual</title>
 <style type="text/css" media="all">
  @import url("/styles/site.css");
  @import url("/styles/mirror.css");
 </style>

--RESULT PAGE AFTER phpQuery::newDocument()--
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<p>YPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
                      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"&gt;
</p>
 <title>PHP: strpos - Manual</title>
<style type="text/css" media="all">
  @import url("/styles/site.css");
  @import url("/styles/mirror.css");
 </style>

PS: notice the old DOCTYPE which is now located in a new <p> element and a
new DOCTYPE is now included

Original issue reported on code.google.com by sni...@gmail.com on 22 Aug 2009 at 11:11

GoogleCodeExporter commented 8 years ago

Original comment by tobiasz....@gmail.com on 23 Aug 2009 at 5:35

GoogleCodeExporter commented 8 years ago
Is there a safe way to prevent phpQuery from altering the document until this 
issue
is resolved?

Thanks

Original comment by sni...@gmail.com on 23 Aug 2009 at 5:58

GoogleCodeExporter commented 8 years ago
Didn't check it yet, but as always try to force content type. Anyway i think 
it's
definitely in DOMDocumentWrapper class. Maybe try different revisions, can be
possible quick fix for now. Good luck, issue looks serious.

Thanks for report.

Original comment by tobiasz....@gmail.com on 23 Aug 2009 at 6:34

GoogleCodeExporter commented 8 years ago
Forcing content type does not help (newDocumentHTML() results in the same issue 
as
above while newDocumentXHTML() results in an error)

Original comment by sni...@gmail.com on 23 Aug 2009 at 7:31

GoogleCodeExporter commented 8 years ago
It's the function charsetFixHTML($markup) which is breaking it.

More specifically, the head is being altered incorrectly in the following 
successive
lines:

$markup = substr($markup, 0, $matches[0][1]).substr($markup,
$matches[0][1]+strlen($metaContentType));

$headStart = stripos($markup, '<head>');

$markup = substr($markup, 0, $headStart+6).$metaContentType.substr($markup,
$headStart+6);

For the moment I have bypassed those lines and phpQuery works fine over the set 
of
HTML pages I am processing. I am not sure if I am breaking something elsewhere 
by
doing this... 

Original comment by sni...@gmail.com on 24 Aug 2009 at 12:29

GoogleCodeExporter commented 8 years ago
I've found a safe fix for this, just replace the functions:

protected function charsetFixHTML($markup) {
        $matches = array();
        // find meta tag
        preg_match('@\s*<meta[^>]+http-equiv\\s*=\\s*(["|\'])Content-Type\\1([^>]+?)>@i',
            $markup, $matches, PREG_OFFSET_CAPTURE
        );

        if (! isset($matches[0]))
            return;

        $metaContentType = $matches[0][0];
        $markup = substr($markup, 0, $matches[0][1])
            .substr($markup, $matches[0][1]+strlen($metaContentType));

        $headStart = preg_match("/<head([\w\W]+?)>/i",$markup,$captures, PREG_OFFSET_CAPTURE);

        if ($headStart > 0) {
            $headStart = $captures[0][1];
            $offset = strlen($captures[0][0]);

            $markup = substr($markup, 0, $headStart+$offset).$metaContentType
            .substr($markup, $headStart+$offset);
        }

        return $markup;
    }
    protected function charsetAppendToHTML($html, $charset, $xhtml = false) {
        // remove existing meta[type=content-type]
        $html = preg_replace('@\s*<meta[^>]+http-equiv\\s*=\\s*(["|\'])Content-Type\\1([^>]+?)>@i', '', $html);
        $meta = '<meta http-equiv="Content-Type" content="text/html;charset='
            .$charset.'" '
            .($xhtml ? '/' : '')
            .'>';
        if (strpos($html, '<head') === false) {
                return $meta.$html;
        } else {
            return preg_replace(
                '@<head(.*?)(?(?<!\?)>)@s',
                '<head\\1>'.$meta,
                $html
            );
        }
    }

Original comment by gabriel...@gmail.com on 9 May 2012 at 4:46

GoogleCodeExporter commented 8 years ago
Thanks for report and gabriel's suggested patch.

I applied this to 0.9.5, and while it did solve the issue, my phpquery code 
which ran fine before, stopped working. I haven't yet had time to analyse why, 
but I don't think the patch is mature enough yet.

I'll look into this and post findings here.

Original comment by tom.law...@conduitinnovation.co.uk on 13 Mar 2013 at 1:57