WordPress / gutenberg

The Block Editor project for WordPress and beyond. Plugin is available from the official repository.
https://wordpress.org/gutenberg/
Other
10.48k stars 4.18k forks source link

Dennis' list of broad and interesting things. #62437

Open dmsnell opened 4 months ago

dmsnell commented 4 months ago

Overall values and goals.

Performance guidelines:

Block Parser

Replace the default everything-at-once block parser with a lazy low-overhead parser.

The current block parser has served WordPress well, but it demands that it parses the entire document into a block tree in memory all at once, and it's not particularly efficient. In one damaged post that was 3 MB in size, it took 14 GB to fully parse the document. This should not happen.

Core needs to be able to view blocks in isolation and only store in memory as much as it needs to properly render and process blocks. The need for less block structure has been highlighted by projects and needs such as:

Block API

Block Hooks

HTML API

Overall Roadmap for the HTML API

There is no end in sight to the development of the HTML API, but development work largely falls into two categories: developing the API itself; and rewriting Core to take advantage of what the HTML API offers.

Further developing the HTML API.

New features and functionality.

Encoding and Decoding of Text Spans

There is so much in Core that would benefit from clarifying all of these boundaries, or of creating a clear point of demarcation between encoded and decoded content.

Decoding GET and POST args.

There is almost no consistency in how code decodes the values from $_GET and $_POST. Yet, there is and can be incredible confusion over some basic transformations that occur:

Prior art

The HTML API can help here in coordination with other changes in core. Notably:

With these new specifications, the HTML API can ensure that whatever is decoded from $_GET and $_POST are what was intended to be communicated from a browser or other HTTP request. In addition, they can provide helpers not present with existing WordPress idioms, like default values.

$search_query = request_arg( 'GET', 'q' );
$search_index = request_arg( 'GET', 'i', 'posts' );

Rewriting Core to take advantage of the HTML API.

Big Picture Changes

Confusion of encoded and decoded text.

There's a dual nature to encoded text in HTML. WordPress itself frequently conflates the encoded domain and the decoded domain.

Consider, for example, wp_space_regexp(), which by default returns the following pattern: [\r\n\t ]|\xC2\xA0| . There are multiple things about this pattern that reflect the legacy of conflation:

Parsing and performance.

In addition to confused and corrupted content, Core also stands to make significant performance improvements by adopting the values of the HTML API and the streaming parser interfaces. Some functions are themselves extremely susceptible to catastrophic backtracking or memory bloat.

Database

Sync Protocol

WordPress needs the ability to reliably synchronize data with other WordPresses and internal services. This depends on having two things:

While this works to synchronize resources between WordPresses, it also serves interesting purposes within a single WordPress, for any number of processes that rely on invalidating data or caches:

XML API

Overall Roadmap for the XML API

While less prominent than the HTML API, WordPress also needs to reliably read, modify, and write XML. XML handling appears in a number of places:

eliot-akira commented 2 days ago

I hope it's OK to comment here, I'm particularly interested in this section: Rewriting Core to take advantage of the HTML API.

Core currently runs many different processing stages on the output, but each processing stage runs over the full contents, and often those contents are processed repeatedly as strings are stitched together and passed around. A final global HTML filter powered by the HTML API could give an opportunity for all of these processing stages to run only once.

There's a nexus of issues involving the current stack of content filters that are applied to the HTML before and after blocks are rendered. I've been bitten by this as a block author and previously with shortcodes, where the output is corrupted unexpectedly - for example, HTML data attribute with escaped JSON that includes a faux "shortcode".

Here are the places where similar sets of content filters are applied, including do_blocks.

add_filter( 'the_content', 'do_blocks', 9 );
add_filter( 'the_content', 'wptexturize' );
add_filter( 'the_content', 'convert_smilies', 20 );
add_filter( 'the_content', 'wpautop' );
add_filter( 'the_content', 'shortcode_unautop' );
add_filter( 'the_content', 'prepend_attachment' );
add_filter( 'the_content', 'wp_replace_insecure_home_url' );
add_filter( 'the_content', 'do_shortcode', 11 ); // AFTER wpautop().
add_filter( 'the_content', 'wp_filter_content_tags', 12 ); // Runs after do_shortcode().
$content = $wp_embed->run_shortcode( $_wp_current_template_content );
$content = $wp_embed->autoembed( $content );
$content = shortcode_unautop( $content );
$content = do_shortcode( $content );
...
$content = do_blocks( $content );
$content = wptexturize( $content );
$content = convert_smilies( $content );
$content = wp_filter_content_tags( $content, 'template' );
$content = str_replace( ']]>', ']]>', $content );
// Run through the actions that are typically taken on the_content.
$content                       = shortcode_unautop( $content );
$content                       = do_shortcode( $content );
$seen_ids[ $template_part_id ] = true;
$content                       = do_blocks( $content );
unset( $seen_ids[ $template_part_id ] );
$content = wptexturize( $content );
$content = convert_smilies( $content );
$content = wp_filter_content_tags( $content, "template_part_{$area}" );
...
$content = $wp_embed->autoembed( $content );
$content = apply_filters( 'the_content', str_replace( ']]>', ']]>', $content ) );

Here are some of the issues related to these content filters.

They lead back to the same root cause that can be solved with an HTML processing pipeline. Instead of content "blobs", search & replacing strings with regular expressions, we would be working with well-defined data structure down to the tags, attributes, text nodes, tokens.


To quote from a prior comment:

All this is related to the issue described in this ticket, where similar sets of content filters are being applied in multiple places, sometimes repeatedly on the same content. In WP 6.2.2, this is further complicated by an inconsistent order of filters.

The way it currently works, there's no way for block authors to opt-out of these filters; the most they can do is to work around or reverse their effects. A better design would be to not process the resulting HTML of a block by default, except for certain blocks where the_content filter makes sense.

I like the idea of a solution that applies wptexturize, do_shortcode and similar filters to specific blocks that need them (paragraph block, classic block, shortcode block, heading block, etc) rather than applying them across all blocks.


It's encouraging to see where the HTML API is headed, as a streaming parser and single-pass processor that could unite and replace existing content filters with something that understands the semantic structure of the HTML document.

A final global HTML filter powered by the HTML API could give an opportunity for all of these processing stages to run only once and they could all run together while traversing the document, for a single pass through the HTML that minimizes allocations and duplicated work.

This could be like a sword that cuts through the knot of existing complected content filters into a consistent and extensible HTML processor.

Interesting how shortcodes appear in these threads, since they're a parallel mini-language embedded in HTML. I'm curious to see how "bits" (dynamic tokens) evolve to become a better shortcode system.

(Was thinking "blit" might be a possible naming, like blocks and blits, since bit already has a common meaning in programming. Well I guess blit does too: "blit - bit block transfer: A logical operation in which a block of data is rapidly moved or copied in memory, most commonly used to animate two-dimensional graphics.")

eliot-akira commented 2 days ago

..On a tangent, I wonder about processing HTML not only in a single pass but to walk the tree in a single flat loop without recursive function calls. I'm guessing this is how a streaming parser is meant to be used.

In pseudo-code, a processor with nested calls might look like:

function processHtmlNodesRecursively(nodes: HtmlNode[]): string {
  let content = ''
  for (const node of nodes) {
    // ..Process node
    const { type, tag, attributes, children = [] } = node
    if (type==='tag') {
      // content += openTag + attributes
      content += processHtmlNodesRecursively(children) // :(
      // content += closeTag
      continue
    }
    // text, comment..
  }
  return content
}

Instead a flat loop with iteration:

function processHtmlNodesIteratively(nodes: HtmlNode[]): string {
+  const stack = [...nodes]
  let content = ''
  let node
+  while (node = stack.shift()) {
    // ..Process node
    const { type, tag, attributes, children = [] } = node
    if (type==='tag') {
      // content += openTag + attributes
+      stack.unshift(...children, closeTag) // :)
      continue
    }
    // text, comment..
  }
  return content
}

Mm that isn't streaming but the basic idea applies, I think, if it used a parser instance to pull each parsed node, and process the nested tree of nodes iteratively or recursively.

Where the DOM operates on the document as a whole — building the full abstract syntax tree of an XML document for convenience of the user — SAX parsers operate on each piece of the XML document sequentially, issuing parsing events while making a single pass through the input stream.

What I'm picturing is a streaming processor and renderer, which evaluates the parsed nodes dynamically back into a string (valid HTML document/fragment).


It reminds me of a Lisp evaluator with tail-call elimination.

Image

(From: The Make-A-Lisp Process - Step 5: Tail call optimization)


This topic is on my mind because I once ran into a stack overflow (or error with max number of recursion) while parsing a large HTML document with deeply nested nodes, due to how the parse function recursively called itself. Since then I think about this ~unrolling the loop~ "replacing recursion with iteration" concept and how it can be applied in different situations for efficiency.


Related:

rehype is an ecosystem of plugins that work with HTML as structured data, specifically ASTs (abstract syntax trees). ASTs make it easy for programs to deal with HTML. We call those programs plugins. Plugins inspect and change trees.

PostHTML is a tool for transforming HTML/XML with JS plugins. PostHTML itself is very small. It includes only a HTML parser, a HTML node tree API and a node tree stringifier. All HTML transformations are made by plugins. And these plugins are just small plain JS functions, which receive a HTML node tree, transform it, and return a modified tree.

rehype is based on hast, Hypertext Abstract Syntax Tree format.

hast is a specification for representing HTML (and embedded SVG or MathML) as an abstract syntax tree. It implements the unist spec.

I like the idea of a language-agnostic spec and schema, for example implemented in PHP and TypeScript. Then the same HTML data structures can be shared and processed by WordPress backend and Block Editor.