Open dac514 opened 6 years ago
I would like to submit a PR but I need clarification on: \Masterminds\HTML5\Parser\EventHandler::text
.
/**
* A unit of parsed character data.
*
* Entities in this text are *already decoded*.
*/
public function text($cdata);
What does Entities in this text are *already decoded* mean? If it means that $cdata
should not be manipulated then changing \Masterminds\HTML5\Parser\DOMTreeBuilder::text
from:
// fprintf(STDOUT, "Appending text %s.", $data);
$node = $this->doc->createTextNode($data);
$this->current->appendChild($node);
To:
// fprintf(STDOUT, "Appending text %s.", $data);
$frag = $this->doc->createDocumentFragment();
$frag->appendXML($data);
$this->current->appendChild($frag);
Solves this issue.
DOMDocument->createTextNode()
encodes entities. Is this the desired behaviour or not?
The thing is that "script" is a raw text tag, so Tokenizer::rawText() looks directly for instead of looking for a possible cdata sequence.
When serializing back the dom, are you using this lib or the domdocument::save xml method?
I am using this lib:
$mm = new \Masterminds\HTML5();
$d2 = $mm->loadHTML( $body );
echo $d2->saveXML();
Entities in this text are *already decoded
, rawText
, .... these methods and comments lead me to believe that if you pass Hé
then the result should be Hé
, not Hé
My question, worded another way, is what is rawText
? The code for rawText
calls text
and text creates the node using createTextNode
which will transform entities.
If this is the case then my proposed fix is no good.
I am using this lib:
Ok, duh. No I am not. My mistake. When changing to:
$mm = new \Masterminds\HTML5();
$d2 = $mm->loadHTML( $body );
$mm->saveHTML( $d2 );
I get:
<!DOCTYPE html>
<html xml:lang="en"><head>
<title>Test</title>
</head>
<body>
<p>1</p>
<script>
//<![CDATA[
for ( var i = 0; i < 10; i++ ) {8
console.log(i);
}
//]]>
</script>
<p>2</p>
</body>
</html>
Maybe this is #nobug? Thank you for your patience.
Doing
$html = '
<!DOCTYPE html>
<html>
<head>
<script>ababc</script>
<script>ab < abc</script>
<script>
// <![CDATA[
foo <
// ]]>
bar
</script>
</head>
</html>';
$html5 = new HTML5();
$doc = $html5->loadHTML($html);
var_dump($html5->saveHTML($doc));
I get:
<!DOCTYPE html>
<html>
<head>
<script>ababc</script>
<script>ab < abc</script>
<script>
// <![CDATA[
foo <
// ]]>
bar
</script>
</head>
</html>
Back from vacation so excuse the all over the place tangent.
Elsewhere in our code, we have often used $mm->saveHTML( $d2 )
where $mm
equals \Masterminds\HTML5
For this specific bug we need to save back as valid XML. We expect similar results to \Masterminds\HTML5\Tests\Parser\DOMTreeBuilderTest::testMoveNonInlineElements
I just tried the proposed fix and, for this specific bug, and it doesn' work for us because <br/>
get turned into <br>
which breaks our XSLT.
The issue remains. We need
<script>
//<![CDATA[
for ( var i = 0; i < 10; i++ ) {
console.log(i);
}
//]]>
</script>
Preserved in XML saves like \DOMDocument
does it.
$html5->saveHTML
is there exactly because the DOM version did not work as expected in many other cases...
Re opening as "feature request"
When loading a XHTML document with JavaScript into html5-php I expect the
<script>//<![CDATA[
conversion to behave like it does in plain old\DOMDocument
. Instead they get turned into htmlentities, breaking the JavaScript on our target platforms (ebook readers).Expected:
Actual:
Test case: