Gutenberg still parsing HTML with regular expressions

WordPress / gutenberg

The Block Editor project for WordPress and beyond. Plugin is available from the official repository.

https://wordpress.org/gutenberg/

Other

10.39k stars 4.15k forks source link

Gutenberg still parsing HTML with regular expressions #5967

Closed gschoppe closed 5 years ago

gschoppe commented 6 years ago

https://github.com/WordPress/gutenberg/blob/ea9416b9e489fd4ba6375413fcd7248a55bffa43/lib/blocks.php#L144

I was under the impression that one of the reasons that HTML comments were considered an acceptable structure, was that they would be parsed with a formal grammar (peg). In PHP, blocks are still being parsed with regular expressions, which are insufficient for processing html safely.

danielbachhuber commented 6 years ago

Hi @gschoppe,

Thanks for the issue.

I was under the impression that one of the reasons that HTML comments were considered an acceptable structure, was that they would be parsed with a formal grammar (peg).

Can you share where you originally gained this impression?

In PHP, blocks are still being parsed with regular expressions, which are insufficient for processing html safely.

Can you share more detail on your concerns and your suggestions?

gschoppe commented 6 years ago

The goal of the PEG parser in Gutenberg is to provide a formal post grammar that can be parsed as a headless tree. This is what it does in JavaScript. As such, falling back to using regular expressions (which cannot parse HTML as a formal grammar, due to the two formats being at different chomsky levels)

You can see the discussion by @dmsnell and @nylen here about the need to have the PHP and JS parsers match in functionality, and you can see the comment by @dmsnell here where he states that as a project, Gutenberg chose to not parse via regex.

As for the issues, this regex carries over the issues of shortcode parsing. A simple example is attempting to parse a block that contains another copy of itself as nested content.

mtias commented 6 years ago

One important clarification: the PEG parser still exists in PHP and is exposed to be used when full-tree consumption is needed. The decision to not use it for typical rendering was because we were unnecessary parsing every block into a tree only to reassemble again, while the only needed operation by default is to transform dynamic blocks (those that have render callbacks assigned) at render time.

There's a discussion on how to accomplish this still using the same grammar spec by introducing another token for dynamic blocks in https://github.com/WordPress/gutenberg/pull/4591 That's pending finding some time to work through the requirements. (We also really need a collection of posts to test the various heuristics.)

danielbachhuber commented 6 years ago

@mtias Can you provide a top of your head list of what types of posts you want? @gschoppe Do you have bandwidth to flesh out the examples?

dmsnell commented 6 years ago

Thanks @gschoppe for paying attention to the code!

I just wanted to chime in and provide some extra context. While I too like to roll my eyes when I see a regular expression trying to parse HTML, what I'm really trying to express is my disapproval of trying to parse non-regular languages with a regular grammar (in the Chomsky sense: recursion, context-free rules, etc…).

It's true we're using the preg_match(…) function here but it's not true that we're crossing that technical boundary. The regular expressions in this care are matching a regular language: blocks constrained to have no nested blocks in them. That is, we're not actually doing anything here that violates the principle which RegExp + HTML breaks. If there were the possibility of a valid dynamic block here having a nested block inside it then we would have to replace this preg_match() with something that can handle that nesting without breaking.

In some unrelated conversation today we tossed around some code which uses the existing grammar to create the same kind of parse the referenced function already does. That is, the specification will remain formal but the implementation may not (as long as it properly implements the spec).

Hope that helps! If you look at any well-established formal tested performant robust parser system then they will likely allow for the use of regular expressions to tokenize, which is the gist of what this function is doing.

gschoppe commented 6 years ago

@mtias & @dmsnell I understand the speed implications of not parsing the full tree at run time, but at the moment, the dynamic blocks are never passed through the PEG parser at all. All the parsing, not just tokenization, is being done based on a regular expression. This may be somewhat safe as long as dynamic blocks are not allowed to have content or nested content, but I believe those are artificial constraints on the API that are going to cause problems eventually.

I also believe that assuming that only dynamic blocks will ever need to be parsed at run-time is an artificial limitation on the capabilities of the API. I can think of many reasons that themes would want to parse specific static blocks, before rendering (such as switching a columns block to use bootstrap-compatible html)

There was talk in #4591 of a light parser that could be run on every block, just to get the name and identify its position in the post, then could check to see if any hooks had been registered on that block. if no hooks and no render function, just output the HTML content. If a short-circuit filter exists, parse the block and pass the name, content, and attributes to the filter. If a render_callback exists, parse and run the render function with attributes, content, and name.

There was some argument in that ticket as to whether the full parser could just be run by plugins as needed, but that introduces problems of its own. It means that for each plugin that needs to fully parse a block at runtime, an additional complete run of the parser is added. This means that a function that was considered too slow to run on standard page loads may get run an unknown number of times on each page load, depending on the combinations of themes and plugins in use.

In general, there is an ideological issue with structured data being handled as unstructured data on page load, and parsing with a regular expression is a symptom thereof. If the data is in a PEG-defined format, it needs to be parsed through that system on load, so that plugins can make use of the structure.

mtias commented 6 years ago

Good thoughts!

If the data is in a PEG-defined format, it needs to be parsed through that system on load, so that plugins can make use of the structure.

The performance was not the main driver for avoiding a parsing step on the server. The reason is more that we should not add layers of potential performance degradations for speculative cases (given a default state without a plugin wanting access to a given block in the tree-shape format). That should be something that happens when necessary, not something that affects all cases irrespective of the context even if it's performant.

The nested issue that prompted the alternative implementation was merely a sign that the server doesn't know how to put together a post from a full-tree structure, and it highlights that unless you are operating on the whole set of blocks (example, moving blocks around, merging blocks, etc), you don't need to generate the full-tree structure for the entire post, only to operate within a node. That is just wasted effort and added complexity for the server at the moment.

I do think there are some valid cases for getting a portion of the post in tree format — just the block sub-tree you care about. This might be something we can add to the render_callback signature, maybe an argument where you specify you want the tree representation for the specific subtree (including nested blocks) given that it's a fair assumption that the person operating on a specific tree node has a say in how it is put back together. This would naturally use the same grammar mechanism, but probably amended so that it can run within a "block" context (so it only cares about a specific block name, for instance).

The problem with a full tree representation is that it might affect block nodes that you don't know how to put together because you lack the block specific implementation that's part of the editor interface.

Also, it's important we solidify the grammar as a canonical description of how posts are constructed (maybe as a document) so that different and potentially better parsers are written against the same set of tests to determine the same output. The tokenization (which is really what the regex in that other PR is doing for dynamic blocks) is the most trivial part of the process since our syntax is regular.

gschoppe commented 6 years ago

The reason is more that we should not add layers of potential performance degradations for speculative cases (given a default state without a plugin wanting access to a given block in the tree-shape format). That should be something that happens when necessary, not something that affects all cases irrespective of the context even if it's performant.

The problem with this is that if there is not a guaranteed common parse of the data to tree format, any plugin that does need that format will need to add an additional run of the parser, which will create a situation where each plugin that needs to parse the post has a geometric impact on performance. Three plugins that need to deal with the tree state would mean three full parses of the post on load.

The nested issue that prompted the alternative implementation was merely a sign that the server doesn't know how to put together a post from a full-tree structure

This could be solved by limiting wrapping blocks from altering the implementation of the blocks they contain. For example, with columns, if the columns were defined as separate divs that contain blocks, rather than a modification of the contained blocks themselves, then each block's implementation could be atomic.

That way, you would simply pass the inner HTML of a block to the render function, along with everything else, and if there is no render callback or filter on the block, if would just return the HTML. In the case of wrapping blocks, you would treat the HTML that wraps each "child-block area" as an array of HTML strings, that will be interleaved with the set of rendered contained blocks, when that block is rendered, unless a callback or filter steps in.

The issue of rendering from tree-view on server side needs to be solved in some way, whether that full parse happens by default or not, or else the ability for plugins to parse to tree-view becomes academic. What good is their ability to run the PEG parser, if the result is a lossy version of the data, that cannot be turned back into a post?

The tokenization (which is really what the regex in that other PR is doing for dynamic blocks) is the most trivial part of the process since our syntax is regular.

This only holds true so long as the grammar is defined in such a way that nesting blocks cannot contain an instance of themselves. Before that pull request modified the definition of the render callback, there was no such limitation on blocks, and the PEG parser itself places no such limitation.

I posted a couple examples of block structures that would benefit from self-nesting dynamic blocks. I really don't believe they should be culled from the capabilities of the language, just to make a regex viable.

mtias commented 6 years ago

That way, you would simply pass the inner HTML of a block to the render function, along with everything else, and if there is no render callback or filter on the block, if would just return the HTML.

You are saying to use the render_callback as a heuristic for whether to return tree-shaped content with inner blocks or just return the HTML when no callback is present? I think that could work, and is in the spirit of what I wrote before — that you should get access to the tree if you know what you are doing. The larger question is whether, by default, WordPress should be assembling the post from a tree when the majority of the content doesn't have that need.

The issue of rendering from tree-view on server side needs to be solved in some way, whether that full parse happens by default or not, or else the ability for plugins to parse to tree-view becomes academic.

Yes, I agree this needs to happen. The question is more whether it needs to happen by default. The level of flexibility in nested groups also means that the representation of inner areas is not always trivial. We have discussed having a beforeHtml and afterHtml, as well as leaving an ephemeral token where innerBlocks were extracted so the server knows where to place them back.

mtias commented 6 years ago

To expand on the latter, given the shape of:

Array [
  Object {
    "attrs": Object {},
    "blockName": "core/block",
    "innerBlocks": Array [],
    "innerHTML": "",
  }
]

We could represent innerHTML as an array of boundary strings (like you mentioned) plus a token placeholder for inner block position.

"innerHTML": Array [ "html", innerBlocksPlacement, "html" ],

And the assumption would be that when there is a single element in the array, there are no block children. This follows some of the original discussion when defining support for nested groups and the importance of not duplicating nested content in the HTML representation. It still stands that we don't want to do this concatenation freely unless there is a need for the structure.

Thanks for the conversation!

gschoppe commented 6 years ago

The larger question is whether, by default, WordPress should be assembling the post from a tree when the majority of the content doesn't have that need.

Personally, I believe there is a need to digest structured data if you send structured data, just to provide consistency to the behavior of the PHP and JS APIs, and to provide locations for filters and actions to be added. After all, there are a lot of developers that need to adopt blocks and start using them, and if they see the default case to be "eh, that just has to do with the editor's UI", they are less likely to learn the system deeply, than if it is a uniform fully-integrated process on both ends.

If it is a deal-breaking issue, and a full parse just can't happen on page load, by default, then there should be a flag to trigger a single canonical parse, via a short-circuit filter. That way, at least, if 5 plugins each need to perform a full parse, they would each only be setting a flag, rather than running their own independent parses that don't integrate with each other.

We could represent innerHTML as an array of boundary strings (like you mentioned) plus a token placeholder for inner block position.

That looks like an excellent way to handle parsing atomic blocks, but it would require making that distinction. The current implementation of the columns block is purposefully non-atomic (it sets a column class designator directly on each of the child blocks), so it could not work with this system. Personally I have no idea how that could work well with placing a dynamic block in a column, since its html is rendered in php, so I think making blocks canonically atomic is a good idea, but there will be dissenting voices regarding additional semantic markup for column containers.

gschoppe commented 6 years ago

We could represent innerHTML as an array of boundary strings (like you mentioned) plus a token placeholder for inner block position.

I was thinking about this more, and considered saying that the token placeholder is unnecessary, since inner blocks will always be separated by blocks of HTML, so you would only need to have an array of the boundary HTML strings, and you could blindly interleave them (i.e. if you have an array of HTML strings that is 4 elements long, you know it represents HTML | Blocks | HTML | Blocks | HTML | Blocks | HTML ).

However, it struck me that you could define nested block areas as a sort of container pseudo-block (only a pseudo-block as it wouldn't have its own render function, instead being rendered by the callback of the parent element), thereby allowing each nested area to contain its own attributes. For complex column logic, or complex conditional logic, or repeaters like accordions and tabs that separation of data might make parsing attributes a lot simpler.

To demonstrate in code, it might look like:

<!-- wp:gjs/timerelease -->
    <!-- wp:container {"interval":"day","index":"0"} -->
         <!-- wp:paragraph -->
    <p class="layout-column-1">Happy Monday!</p>
    <!-- /wp:paragraph -->
    <!-- /wp:container-->
    <!-- wp:container {"interval":"day","index":"1"} -->
         <!-- wp:paragraph -->
    <p class="layout-column-1">Happy Tuesday!</p>
    <!-- /wp:paragraph -->
    <!-- /wp:container-->
    <!-- wp:container {"interval":"day","index":"2"} -->
         <!-- wp:paragraph -->
    <p class="layout-column-1">Happy Wednesday!</p>
    <!-- /wp:paragraph -->
    <!-- /wp:container-->
    <!-- wp:container {"interval":"day","index":"3"} -->
         <!-- wp:paragraph -->
    <p class="layout-column-1">Happy Thursday!</p>
    <!-- /wp:paragraph -->
    <!-- /wp:container-->
    ...
<!-- /wp:gjs/timerelease -->

For a static nested block like columns, it would look more like this:

<!-- wp:columns -->
<div class="wp-block-columns has-2-columns">
    <div class="layout-column-1">
        <!-- wp:container {"layout":"column-1"} -->
            <!-- wp:paragraph -->
                <p>column 1 content</p>
            <!-- /wp:paragraph -->
        <!-- /wp:container -->
    </div>
    <div class="layout-column-2">
        <!-- wp:paragraph {"layout":"column-2"} -->
            <p>column 2 content</p>
        <!-- /wp:paragraph -->
    </div>
</div>
<!-- /wp:columns -->

The larger question is whether, by default, WordPress should be assembling the post from a tree when the majority of the content doesn't have that need.

As an example of a more common situation that tree-parsing could assist in, I think parsing the image block on load could have some serious benefits, with regard to handling image assets that have been deleted or renamed, bringing src-sets in with far less complexity, and providing a single point to filter and modify the output for developers. Doing so would completely remove broken image tags, allowing themes to define a fallback image, to use in the case of missing content, and would make plugins like WP Smartcrop much easier and more reliable to implement.

It would also open the door to updating image tags as new features gain better support. For example, the webp format in a <picture> tag isn't yet supported well enough to use on production sites without some messy javascript involved, but with a dynamic image block, it could be subbed into existing content when the support gets universal enough.

I think if the possibility to parse these static blocks is provided by default, a lot of common situations like that will be found organically, as it encourages developers to consider the post as a set of blocks, even when not in the context of the editor.

gschoppe commented 5 years ago

Has this issue been resolved? I haven't seen any merges that would show the PEG parser being used in do_blocks(). It looks like it's just been closed to save time, but this is a major structural shortcoming of Gutenberg.

mtias commented 5 years ago

Sorry didn't add more context, the main change has been all the work gone into https://github.com/WordPress/gutenberg/pull/8083 which given the drastic performance improvements should allow these operations to go through it.