Open jeherve opened 5 years ago
When are we looking for <ul>
? Content from the block editor?
I ask because the gallery_style
filter should always have a <div>
opening tag.
A doctype and HTML wrappper is added around the HTML we are modifying.
I'm not sure that LIBXML_HTML_NOIMPLIED
is sufficient here. DOMDocument
always needs a root node, and the output from the gallery_style
filter never has one. (The output from the the_content
filter may not have one either.) For example:
$dom->loadHTML( '<style>foo</style><div class="hello">', LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD );
var_dump( $dom->saveHTML() ); // <style>foo<div class="hello"></div></style>
So we should provide a root element, and we should probably make up a fake type of element to protect against losing content if the input is broken in peculiar ways. Perhaps something like:
$tag = "jetpack-" . mt_rand( 10000, 99999 );
$dom->loadHTML( "<$tag>$html</$tag>", LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD );
…
return substr( trim( $dom->saveHTML() ), strlen( $tag ) + 2, -1 * ( strlen( $tag ) + 3 ) );
The
div
tag we're editing (we are adding a new attribute) also gets autoclosed.
Part of the problem is that we're using this one function (Jetpack_Carousel->add_data_to_container()
) in two different contexts:
gallery_style
filter, which contains only the opening part of the <div>
.the_content
filter, which contains the full HTML of the post.For 1, we could run the function, then remove the closing tag.
For 2, we can just run the function as is.
DOMDocument
can also do some weird things with character encodings if we're not careful. Let's be sure to test everything with fancy UTF-8 characters and probably some data in other charsets.
This issue has been marked as stale. This happened because:
No further action is needed. But it's worth checking if this ticket has clear reproduction steps and it is still reproducible. Feel free to close this issue if you think it's not valid anymore — if you do, please add a brief explanation.
Edit, this might not be super useful after all: https://github.com/Automattic/jetpack/pull/13547#issuecomment-535281664 seems like the encoding needs to be known up front?
I recently had to figure something similar out for the block pattern translation stuff we've been doing, this snippet is possibly useful.
libxml_use_internal_errors( true );
$dom = new \DomDocument();
$dom->loadHTML( "<html><body>" . mb_convert_encoding( $post_content, 'HTML-ENTITIES', 'UTF-8' ) . "</body></html>" ), LIBXML_HTML_NODEFDTD );
// do some dom stuff
$new_content = preg_replace( [ '/^<html><body>/', '/<\/body><\/html>$/' ], '', $dom->saveHTML() );
// if no changes made this should pass
assert($new_content === $post_content);
mb_convert_encoding
was to prevent the mojibake we were seeing in some attributes. Manually adding and removing the html/body wrap works around how LIBXML_HTML_NOIMPLIED
causes a whitespace rewrite which breaks block validation in some cases.
I think you'd still need to close off and then unclose that div in one of the cases but that should be able to work the same way as the html/body wrap.
There may be more to this, I'm not completely sure about the contexts of use here but it's certainly a similar case.
Following #13446 (reverted in #13564), I'm seeing some issues in some environments, where:
div
tag we're editing (we are adding a new attribute) also gets autoclosed.I'll need to revise my implementation to take this into account. Here is a summary of the current problems: https://stackoverflow.com/q/58096669/1045686
I may have to give up on using
DOMDocument
here after all.Noting that this is happening on WordPress.com for example, so this is blocking D32610-code.
Original issue: #13428 Related: #13604