adamziel commented 2 years ago

What problem is this issue looking to solve?

I'd like Gutenberg PHP code to process HTML using an actual parser instead of regular expressions as it does now.

A few examples of regexps in action:

Regular expressions are error-prone, hard to debug, and can fail due to unexpected corner-cases.

What solution does this issue propose

Let's lean on an HTML parser.

However, let's avoid using DOMDocument if possible. A related discussion surfaced the following problems with it:

It requires a PHP extension that may not be available, meaning we'd need a fallback
It's an XML/HTML4 at heart and is known to have deficiencies in parsing modern HTML

What would we use instead?

Step 1: Identify a parser to lean on

I've looked in Google and Github for PHP HTML Parser, DOM Parsers, component libraries and frameworks, HTML Formatters, and HTML Tokenizers. I also went through this great StackOverflow answer.

Here's the list of libraries I found:

Compatible with PHP 5 and maintained recently

hQuery – Parses HTML, but doesn't support updating attributes or injecting nodes.
HTML Purifier – Not a parser library, but ships with an HTML tokenizer that Gutenberg could reuse.

Neither gets us 100% there, but they'd make great starting points.

Unmaintained recently

Incompatible with PHP 5 or dependent on DOMDocument

Step 2: Figure out what's next

There are three possible ourcomes:

Use a on-DOMDocument parser – how can it be brought to Gutenberg and then to WordPress?
Use a DOMDocument parser – what's the best way to document its limitations?
Don't use any parser at all – would that mean regular expressions are the way to parse HTML in Gutenberg, then?

CC @azaozz @hellofromtonya @dmsnell @draganescu @getdave @scruffian @mtias @youknowriad @anton-vlasenko @noisysocks @Mamaduka @paaljoachim @mcsf

dmsnell commented 2 years ago

I don't consider this a real need at this point, largely because of the raw absence of a usable PHP HTML parser. You are free as a block author to do the parsing yourself in a way that is likely safe, which is far easier to determine on a block-by-block basis than it was before blocks when we had to consider the entire post's contents.

Note that block parsing does not parse HTML at all and the parser is reliable up to the failure modes for corrupt inputs, but in practice it's been as solid as anything else in WordPress.

Note too what we can reliably achieve with PCRE patterns if we take a few extra bits of care:

matching specific HTML opening tags
matching attributes within an HTML tag

Technically PCRE patterns can fully parse HTML due to their recursive nature but we have avoided introducing such complexity into the parse.

What is the motivation for overhauling the entire system to use HTML parsing vs. adding it where it's helpful within a block's server-side rendering function?

adamziel commented 2 years ago

@dmsnell I’d like to avoid reinventing and maintaining all these regexps. A new corner case in the markup could have a nasty ripple effect across the different implementations of the same solution. It matters even more now that we’re adding these class names to previously „unclassed” blocks.

This issue stems from https://github.com/WordPress/gutenberg/pull/40859 where the Parser was discussed extensively. I meant this less as a systemif overhaul, and more of a way to avoid regexps in core blocks as they are getting increasingly complex. See https://github.com/WordPress/gutenberg/pull/42122 for example.

An alternative could be having a few canonical, regexp-based implementations of common tasks such as „update tag attribute.” Not perfect, but least it would ease the maintenance.

spacedmonkey commented 2 years ago

In my plugin wp-rest-blocks, I used pquery to parse block data into PHP data. This plugin is open source, so feel free to look at the code.

dmsnell commented 2 years ago

An alternative could be having a few canonical, regexp-based implementations of common tasks such as „update tag attribute.” Not perfect, but least it would ease the maintenance.

This seems like a much more pragmatic solution that doesn't have to imply system-wide changes. I think we'd want to be cautious about what API we create though because whatever we create will get used beyond where we expect.

For instance, block markup tends to be simple, but any utility we make is likely going to ignore nested tags of the same type and pick the one that occurs lexically first in a document.

I also want to avoid reinventing all the regex's but also the dynamic token work is hopefully going to remove a lot of the need. Also looks like #42122 is creating the problem you are proposing to solve here. Maybe we just leave those CSS classes in the markup?

azaozz commented 2 years ago

Parsing HTML in PHP doesn't seem like a good idea, no matter the use cases. The main problem is that no PHP parser will be able to parse HTML the same way the browsers do. Ever. The differences are mostly in how the browsers handle malformed, incomplete, illegal, messed up HTML. Another big area of headaches is that HTML keeps evolving, the browsers implement new standards pretty much constantly. Who is going to maintain that PHP parser to keep up? :)

Looking at the three example use cases, all sound a bit like "hacks", perhaps. (Hacks in the sense they need better solutions, not trying to tweak HTML in PHP).

Removing white space from HTML should never be needed. The above is a side effect of not using appropriate placeholder.
Injecting inline styles to HTML tags and injecting CSS class names to HTML tags from PHP sound like tasks that should be performed in the editor where the user knows and can see what's happening and why. Tweaking the user input and trying to guess what the user intended is prone to errors and never a good thing.

I agree that some tasks may need to be automated in PHP. For that it would be best to use some form of placeholders in the HTML. A big advantage is that these placeholders can be made visible to the users while they are creating content, if needed. Another big advantage is that they can be standardized and used across different places. Yes, it sounds like shortcodes (ughhhhh!) , but I believe these new "placeholders" can be made quite better and avoid the pitfalls of the old shortcodes.

Alternatively a "string position" based approach may work even better. The idea would be to have start and offset just like PHP's strpos(). These will be added to a named param in the block attributes. So it would look something like:

(Can even be simpler, just prefixed param names). Then these can pretty easily be parsed when parsing the blocks. In PHP we'll get a nice map with all the places and sub-strings that need to be processed (if any).

This would be much cleaner approach than anything that "messes with" the HTML.

andrewserong commented 2 years ago

Oh, great discussion, thanks for raising this @adamziel!

An alternative could be having a few canonical, regexp-based implementations of common tasks such as „update tag attribute.” Not perfect, but least it would ease the maintenance.

I quite like this idea, particularly for some of these more common tasks. Perhaps it could be something like inject_block_wrapper_attribute to match get_block_wrapper_attributes? The function could be scoped to just target injecting attributes to the outer block wrapper.

With the behaviour centralised, we'd hopefully reduce some of the risk surrounding difficult to read or test regex patterns, and it'd open up a potential path to refactor the internals to another parsing / validation approach further down the track?

gziolo commented 2 years ago

A few examples of regexps in action:

Remove whitespaces from the block markup

Inject inline styles to HTML tags

Inject CSS class names to HTML tags

Those use cases highlighted are also related to the ongoing work to create the style engine and dynamic tokens. I would wait for how that evolves before seeking a unified way to work with HTML modifications in blocks on the server. Ideally, it isn’t necessary at all and the block editor can handle that internally.

mtias commented 2 years ago

I think the problem is not very well formulated here which is leading to a wide reaching solution-first approach. It's important to establish that there is a drastic reduction in complexity between parsing generic HTML in PHP and dealing with the internal markup of block boundaries as defined by a block spec. The block tree is already assembled server-side and doesn't need traditional HTML parsing on a grand scale, so the larger considerations of complex nesting, invalid tags, etc, shouldn't be that relevant — the editor doesn't serialize invalid HTML within valid block boundaries, and that should be taken as an axiom. The issue you are running into, then, can be reduced to a more clearly defined problem:

Access to sourced attributes in the server.
Ability to reconstruct block serialized HTML ergonomically.

Until you have both pairs, adding HTML attributes to the static fallback of a block is going to require some manual intervention and knowledge of the block. Some of that knowledge is expressed in the block attribute declaration which uses a selector. You can go a step further and assume you are dealing with just the outer most layer of a block, which is where attributes are generally attached (classes, etc). That means you could have a light extraction layer (you don't want to overwrite classes) and a function block_wrapper( $tag_name, $attributes ) to reassemble (either implicit or explicit), as a possible way.

adamziel commented 1 year ago

With WP_HTML_Tag_Processor merged, reading and updating HTML attributes no longer requires regular expressions. This fulfills the original goal of this issue, so I'm going to close it even though WP_HTML_Tag_Processor isn't a full parser. Feel free to reopen if a need for a full parser arises.

WordPress / gutenberg

Blocks: Process HTML With a Parser (Instead of Regular Expressions) #42128

What problem is this issue looking to solve?

What solution does this issue propose

Step 1: Identify a parser to lean on

Compatible with PHP 5 and maintained recently

Unmaintained recently

Incompatible with PHP 5 or dependent on DOMDocument

Step 2: Figure out what's next