Masterminds / html5-php

An HTML5 parser and serializer for PHP.
http://masterminds.github.io/html5-php/
Other
1.57k stars 114 forks source link

CDATA and JavaScript gets mangled when manipulating XHTML #139

Open dac514 opened 6 years ago

dac514 commented 6 years ago

When loading a XHTML document with JavaScript into html5-php I expect the <script>//<![CDATA[ conversion to behave like it does in plain old \DOMDocument. Instead they get turned into htmlentities, breaking the JavaScript on our target platforms (ebook readers).

Expected:

<script>
//<![CDATA[
for ( var i = 0; i < 10; i++ ) {
    console.log(i);
}
//]]>
</script>

Actual:

<script>
//&lt;![CDATA[
for ( var i = 0; i &lt; 10; i++ ) {
    console.log(i);
}
//]]&gt;
</script>

< becomes &lt;, > becomes &gt;, ...

Test case:

$body = '<?xml version="1.0" encoding="UTF-8" ?>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>Test</title>
</head>
<body>
<p>Hi There!</p>
<script>
//<![CDATA[
for ( var i = 0; i < 10; i++ ) {
    console.log(i);
}
//]]>
</script>
</body>
</html>';

$d1 = new \DOMDocument();
$d1->loadXml( $body );
echo $d1->saveXML();

$mm = new \Masterminds\HTML5();
$d2 = $mm->loadHTML( $body );
echo $d2->saveXML();
dac514 commented 6 years ago

I would like to submit a PR but I need clarification on: \Masterminds\HTML5\Parser\EventHandler::text.

    /**
     * A unit of parsed character data.
     *
     * Entities in this text are *already decoded*.
     */
    public function text($cdata);

What does Entities in this text are *already decoded* mean? If it means that $cdata should not be manipulated then changing \Masterminds\HTML5\Parser\DOMTreeBuilder::text from:

// fprintf(STDOUT, "Appending text %s.", $data);
$node = $this->doc->createTextNode($data);
$this->current->appendChild($node);

To:

// fprintf(STDOUT, "Appending text %s.", $data);
$frag = $this->doc->createDocumentFragment();
$frag->appendXML($data);
$this->current->appendChild($frag);

Solves this issue.

DOMDocument->createTextNode() encodes entities. Is this the desired behaviour or not?

Source: https://stackoverflow.com/a/7725965

goetas commented 6 years ago

The thing is that "script" is a raw text tag, so Tokenizer::rawText() looks directly for instead of looking for a possible cdata sequence.

When serializing back the dom, are you using this lib or the domdocument::save xml method?

dac514 commented 6 years ago

I am using this lib:

$mm = new \Masterminds\HTML5();
$d2 = $mm->loadHTML( $body );
echo $d2->saveXML();

Entities in this text are *already decoded, rawText, .... these methods and comments lead me to believe that if you pass then the result should be , not H&eacute;

My question, worded another way, is what is rawText? The code for rawText calls text and text creates the node using createTextNode which will transform entities.

If this is the case then my proposed fix is no good.

dac514 commented 6 years ago

I am using this lib:

Ok, duh. No I am not. My mistake. When changing to:

$mm = new \Masterminds\HTML5();
$d2 = $mm->loadHTML( $body );
$mm->saveHTML( $d2 );

I get:

<!DOCTYPE html>
<html xml:lang="en"><head>
<title>Test</title>
</head>
<body>
<p>1</p>
<script>
//<![CDATA[
for ( var i = 0; i < 10; i++ ) {8
    console.log(i);
}
//]]>
</script>
<p>2</p>
</body>
</html>

Maybe this is #nobug? Thank you for your patience.

goetas commented 6 years ago

Doing

$html = '
<!DOCTYPE html>
<html>
    <head>
        <script>ababc</script>
        <script>ab < abc</script>
        <script>
        // <![CDATA[
            foo <
        // ]]>
        bar
        </script>
    </head>
</html>';

$html5 = new HTML5();
$doc = $html5->loadHTML($html);
var_dump($html5->saveHTML($doc));

I get:

<!DOCTYPE html>
<html>
    <head>
        <script>ababc</script>
        <script>ab < abc</script>
        <script>
        // <![CDATA[
            foo <
        // ]]>
        bar
        </script>
    </head>
</html>
dac514 commented 6 years ago

Back from vacation so excuse the all over the place tangent.

Elsewhere in our code, we have often used $mm->saveHTML( $d2 ) where $mm equals \Masterminds\HTML5

For this specific bug we need to save back as valid XML. We expect similar results to \Masterminds\HTML5\Tests\Parser\DOMTreeBuilderTest::testMoveNonInlineElements

I just tried the proposed fix and, for this specific bug, and it doesn' work for us because <br/> get turned into <br> which breaks our XSLT.

The issue remains. We need

<script>
//<![CDATA[
for ( var i = 0; i < 10; i++ ) {
    console.log(i);
}
//]]>
</script>

Preserved in XML saves like \DOMDocument does it.

goetas commented 6 years ago

$html5->saveHTML is there exactly because the DOM version did not work as expected in many other cases... Re opening as "feature request"