Automattic / jetpack

Security, performance, marketing, and design tools — Jetpack is made by WordPress experts to make WP sites safer and faster, and help you grow your traffic.
https://jetpack.com/
Other
1.59k stars 798 forks source link

Carousel: issues modifying markup in some environments #13531

Open jeherve opened 5 years ago

jeherve commented 5 years ago

Following #13446 (reverted in #13564), I'm seeing some issues in some environments, where:

I'll need to revise my implementation to take this into account. Here is a summary of the current problems: https://stackoverflow.com/q/58096669/1045686

I may have to give up on using DOMDocument here after all.

Noting that this is happening on WordPress.com for example, so this is blocking D32610-code.


Original issue: #13428 Related: #13604

mdawaffe commented 5 years ago

When are we looking for <ul>? Content from the block editor?

I ask because the gallery_style filter should always have a <div> opening tag.


A doctype and HTML wrappper is added around the HTML we are modifying.

I'm not sure that LIBXML_HTML_NOIMPLIED is sufficient here. DOMDocument always needs a root node, and the output from the gallery_style filter never has one. (The output from the the_content filter may not have one either.) For example:

$dom->loadHTML( '<style>foo</style><div class="hello">', LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD );
var_dump( $dom->saveHTML() ); // <style>foo<div class="hello"></div></style>

So we should provide a root element, and we should probably make up a fake type of element to protect against losing content if the input is broken in peculiar ways. Perhaps something like:

$tag = "jetpack-" . mt_rand( 10000, 99999 );
$dom->loadHTML( "<$tag>$html</$tag>", LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD );
…
return substr( trim( $dom->saveHTML() ), strlen( $tag ) + 2, -1 * ( strlen( $tag ) + 3 ) );

The div tag we're editing (we are adding a new attribute) also gets autoclosed.

Part of the problem is that we're using this one function (Jetpack_Carousel->add_data_to_container()) in two different contexts:

  1. In the gallery_style filter, which contains only the opening part of the <div>.
  2. In the the_content filter, which contains the full HTML of the post.

For 1, we could run the function, then remove the closing tag.

For 2, we can just run the function as is.


DOMDocument can also do some weird things with character encodings if we're not careful. Let's be sure to test everything with fancy UTF-8 characters and probably some data in other charsets.

stale[bot] commented 4 years ago

This issue has been marked as stale. This happened because:

No further action is needed. But it's worth checking if this ticket has clear reproduction steps and it is still reproducible. Feel free to close this issue if you think it's not valid anymore — if you do, please add a brief explanation.

lsl commented 4 years ago

Edit, this might not be super useful after all: https://github.com/Automattic/jetpack/pull/13547#issuecomment-535281664 seems like the encoding needs to be known up front?

I recently had to figure something similar out for the block pattern translation stuff we've been doing, this snippet is possibly useful.

libxml_use_internal_errors( true );
$dom = new \DomDocument();
$dom->loadHTML( "<html><body>" . mb_convert_encoding( $post_content, 'HTML-ENTITIES', 'UTF-8' ) . "</body></html>" ), LIBXML_HTML_NODEFDTD );
// do some dom stuff
$new_content = preg_replace( [ '/^<html><body>/', '/<\/body><\/html>$/' ], '', $dom->saveHTML() );
// if no changes made this should pass
assert($new_content === $post_content);

mb_convert_encoding was to prevent the mojibake we were seeing in some attributes. Manually adding and removing the html/body wrap works around how LIBXML_HTML_NOIMPLIED causes a whitespace rewrite which breaks block validation in some cases.

I think you'd still need to close off and then unclose that div in one of the cases but that should be able to work the same way as the html/body wrap.

There may be more to this, I'm not completely sure about the contexts of use here but it's certainly a similar case.