New HTML Tag Processor - An in-depth tutorial

bph commented 1 year ago

Discussed in https://github.com/WordPress/developer-blog-content/discussions/75

^{Originally posted by **bph** February 8, 2023} WordPress 6.2 introduces the first component in a [new HTML processing API](https://github.com/WordPress/gutenberg/issues/44410). Before WordPress 6.2, the block markup was typically updated using regular expressions or the DOMDocument class. Both have downsides. The former is tedious and prone to security issues, while the latter uses libxml2 which [does not support HTML5](https://gitlab.gnome.org/GNOME/libxml2/-/issues/211). ([Adam Zielinski](https://github.com/WordPress/gutenberg/issues/44410#issuecomment-1420787816)) [Beyond the Dev Note](https://github.com/WordPress/gutenberg/issues/44410#issuecomment-1420787816), it would be beneficial to PHP developers if there is a longer in-depth post on the Developer Blog. cc: @adamziel , @dmsnell, @justintadlock

dmsnell commented 1 year ago

Thanks @bph !

dmsnell commented 1 year ago

Status

@adamziel created an interactive tutorial on his website
@adamziel and I have been slowly prepping some interview-style discussion of the new API to lead a post on using it.

bph commented 1 year ago

The Dev Note has been published: Introducing the HTML API in WordPress 6.2

dmsnell commented 1 year ago

Will work on this in the coming week: goal is to have something together, a reasonable draft, by next Friday, April 21

ndiego commented 1 year ago

After discussing this with @justintadlock, I am going to pick up this article.

justintadlock commented 1 year ago

Just noting that @dmsnell seems to be doing this for now.

gaambo commented 1 year ago

I'd like to note one thing (if this isn't the right place for that, please excuse me): I always find the examples for the tag processor "confusing", because they are using $p as a variable for the processor/html. I always expect, that it's modifying a paragrah <p> tag, because of the name of the variable. It's definately a minor thing, but I think using clear variable names would make it easier to follow the examples. So in may usage, I often use $html for the tag processor instance and then when modifying specific tags i use the tag as a variable name - eg: $a = $html->next_tag('a'); or $div = $html->next_tag('div');

dmsnell commented 1 year ago

thanks @gaambo - we deliberated a lot when designing the examples and started with the more verbose $tag_processor name, but it got a bit heavy-feeling. we shorted it to $p for $processor

I encourage folks to avoid storing the tag processor itself or the result of next_tag into names relating to HTML tag names, because that's not what they give you. next_tag() only tells you if it found a given tag, so that $div doesn't refer to any specific DIV element. maybe $found_a_div = $html->next_tag('div') is more descriptive there.

Do you find $processor = new WP_HTML_Tag_Processor( $html ) clearer than $p? $p has a nice terse cadence to it, though I can see where it might be confusing if the assumption is that it's storing some kind of tag.

Same caution exists for calling it $html, because $html is both a string and the name of the input HTML argument to the processor. It's given HTML and creates a Processor.

Naming is hard 😄

gaambo commented 1 year ago

Thanks for the explanation - tht makes sense, and in this case my examples were actually bad 😅 I, personally, think $processor is more more descriptive and better. But, as I said, it's a minor thing, but just wanted to mention it because it really confuses me in examples 😅

dmsnell commented 1 year ago

Hello all, I've reached a first draft of a brand new version of the post.

Happy to have any review you might want to share. As a first draft, I could end up throwing everything out. That being said, the most valuable kinds of feedback will revolve around how the information is organized, things you think might be missing, things that are unclear, and things that you think should be communicated differently.

Thanks for all your collaboration on this!

bph commented 1 year ago

@zzap @marybaum Would you have time to review @dmsnell draft on Google Doc

marybaum commented 1 year ago

Happy to! Can’t for about four hours though — D has a procedure this morning

zzap commented 1 year ago

Hi @dmsnell, thank you so much for writing about this. I'm both personally and professionally very interested in this topic and can't wait for the article to be published.

Overall, I love how it educates us on both Tag Processor and regular expression. I like the flow and structure.

I have a few suggestions:

Is HTML without Regular Expressions the new title? If so, I would add Tag Processor: HTML without Regular Expressions or HTML API:... or something like that. My fear is that people will misjudge and skip it based on that title. For example, the first thought of HTML is not trying to match it with regular expressions. It's doing frontend. Missing the API name will leave people unaware of the article's topic.
I'd add some intro before the first goal. e.g. Let's consider a few goals that can easily be achieved with Tag Processor. The reason is: in the previous paragraph, you talked about the reason why API was introduced in WordPress, and now when I see Goal, my first reaction is that this was the goal for getting API in the core.
This sentence is a bit long. I had to read it twice to understand it fully. Can it be a bit simplified? Taking a step back there’s another problem in view: even if in this one spot the regular expression pattern were complicated enough to understand all of the HTML syntax and all of the semantic processing rules, it would still be one of thousands of places where the code is unique and behaves differently than other HTML processing code and needs its own separate maintenance and bug tasks.
I agree with @gaambo about var naming and $p being confusing.
Filter render_block example is done with closure, which is officially not recommended by wp.org because they are impossible to be removed. Is there any specific reason you used it that way, and if not, can you write it with a callback?
This sentence is also a bit unclear to me (perhaps it's a second language thing): This is durable code because it’s expressing what it seeks to accomplish.

Tiny thingies:

The first sentence has use twice. Is there a better way to say it?
Missing space in greedyand
Plural should be singular in "even if in this one spot the regular expression pattern were complicated enough to understand"

As for use cases, I had the pleasure of seeing Tag Processor being used to remove nofollow attributes from anchors with specified domain names. It's pure joy, and we updated WordPress to 6.2 only to have that available.

Thank you again for writing this article :heart:

dmsnell commented 1 year ago

I had the pleasure of seeing Tag Processor being used to remove nofollow attributes from anchors with specified domain names. It's pure joy, and we updated WordPress to 6.2 only to have that available.

aw. thanks for melting my heart @zzap

Is HTML without Regular Expressions the new title?

Yes, and in fact I wanted to make this more about parsing HTML and Regular Expressions than about the Tag Processor. That is, I've been walking the line carefully about trying to avoid communicating that people ought to find a way to use the Tag Processor - I don't want to convey that through this post. I'd rather address the problem where it's at, "trying to do things in HTML and reaching for regular expressions or DOMDocument, and then show how this API can do it better. In other words, let the problem lead the solution instead of broadcasting this as "you should be using the HTML API."

Any thoughts on how to mix your feedback with this?

Let's consider a few goals that can easily be achieved with Tag Processor when I see Goal, my first reaction is that this was the goal for getting API in the core.

I'd be interested in hearing a re-wording of this. On one hand, it sounds like you appropriately understood the reason this API was introduced - mission accomplished 😆

This sentence is a bit long

Thanks, I'll work on that one.

Is there any specific reason you used it that way

Nope, just trying to avoid distracting from the topic at hand. Happy to change that.

Plural should be singular in "even if in this one spot the regular expression pattern were complicated enough to understand"

I need your help identifying the plural here. If it's the highlighted were that's not plural, it's the appropriate English subjunctive case, in contrast to the often misused even if it was the case.

I'll take a deeper look at all this feedback after I get back from some vacation. Thanks for reviewing the doc!

zzap commented 1 year ago

I need your help identifying the plural here. If it's the highlighted were that's not plural, it's the appropriate English subjunctive case, in contrast to the often misused even if it was the case.

Ahh, it was my bad English. Thank you for explaining it :heart:

Any thoughts on how to mix your feedback with this?

OK, I hear your point. In that case, I'd add the word parsing somewhere. E.g. Parsing HTML? Forget the Regular Expression

I'd be interested in hearing a re-wording of this. On one hand, it sounds like you appropriately understood the reason this API was introduced - mission accomplished laughing

Well, I would just add some small intro for examples there, as a bridge between the article intro and actual examples. Because the first goal wasn't the only goal for getting API into the core. It's nothing major, just feels a bit disconnected, like two independent pieces of content were put one after the other. I don't know if I'm explaining this properly.

dmsnell commented 1 year ago

Thanks again @zzap - made another round of edits and updated the document. In places where you mentioned it was hard to read I changed the sentences around to avoid the problems.

the first goal wasn't the only goal for getting API into the core

Maybe it's good to talk about this more here together. The two examples in the document are basically the leading motivations for the introduction of the HTML API. Even though many more applications are possible now, its inception is quite modest.

It's completely possible I'm being quite ignorant here though and am overlooking something obvious.

In any case, at this point in time, my primary hope is that people at large only start adopting the HTML API for very limited cases like the examples. I want to see continued development and use of the HTML API, but in a way that fosters communication back to its development and informs its needs.

justintadlock commented 1 year ago

@dmsnell - Can change the share settings for the doc to "Anyone with the link" can comment?

dmsnell commented 1 year ago

@justintadlock done.

justintadlock commented 1 year ago

Thanks. I did a full read through it and really enjoyed what you came up with. I left a few minor in-doc comments. Other than those, it looks good to me.

dmsnell commented 1 year ago

Thanks @justintadlock - done.

bph commented 1 year ago

@dmsnell it reads really well. I did another sweep of commas in and out, and other minor suggestions.

In your email space for your .org account you should find an invitation to the developer news blog with a request to accept. Once you feel you are ready, move your text there, and share the Public preview link in a comment below.

Here are two checklists:

Pre-publishing checklist:

[ ] Post Title in Sentence case
[ ] Use "Center Column" template
[ ] Are Category or Categories selected?
[ ] Are Tags identifies?
[ ] Is there an explicit Excerpt?
[ ] Do all images have an alt-text?
[ ] Props added? 🙌 Publish! 📗

Post-publishing checklist

[ ] add Props for reviews to #props channel in WP Slack
[ ] close the issue with a comment to link to the published post
[ ] close the accompanying discussion with the link to the published post.

Some of it is self-explanatory, some is not. I would be happy to walk you through it. It's your first post and I can't be more excited for you to be a part of the Developer blog crew :-)

marybaum commented 1 year ago

A general guide:

most commas belong in pairs, unless they’re setting off a clause.

Also, here are two proper ways to identify a person or a group:

The FSE outreach director, Anne McCarthy, said the group…

FSE Outreach Director Anne McCarthy said the group…

If you use a comma, you need the phrase to start with an article (a, an, or the).

If you don’t use an article, then lose the comma.

from my phone

dmsnell commented 1 year ago

Thanks @marybaum for all your comments. I'd love for you to get credit for all that writing. Would you like to create a draft post so you can be the author? I'll copy the contents from the Google Doc into that if you do.

dmsnell commented 1 year ago

@marybaum I've adopted all your suggestions in the document but left a few questions where it seems like we've intentionally changed what's being said in a way that's factually inaccurate.

would love to see you get the credit for this writing. @bph apart from copying the contents from the Google Doc into a draft post do you need me for anything else on this? I'm extremely excited to put this behind me.

bph commented 1 year ago

It would be great if you could include the pre-publish checklist when you copy/paste as much as you can. But that: would be it. I will push it over the finish line 🥰

marybaum commented 1 year ago

I’m so flattered by all this! I only got about halfway through, and I’m afk today and tomorrow. We’re in Colorado for some mountain tennis and pickleball, but will have some downtime while Dick drives us to some mountain destinations the next few days after that.

bph commented 1 year ago

@dmsnell Let me know how I can assist getting this over the finish line, maybe even this week?

dmsnell commented 1 year ago

@bph let's get that post created with @marybaum as the author. I'm happy to do all the work copying the content into it and reformatting. I'd do this but I don't think I can change the author if I create the post.

marybaum commented 1 year ago

Ah! Tonight’s activities!We’re headed to another national park today, then I’ll do the undone half of that at the hotel after.Meanwhile the photos await…... from my phoneOn Jul 18, 2023, at 9:48 AM, Dennis Snell @.***> wrote: @bph let's get that post created with @marybaum as the author. I'm happy to do all the work copying the content into it and reformatting. I'd do this but I don't think I can change the author if I create the post.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

marybaum commented 1 year ago

Finished editing the second half. I'll pull it into the p2 tomorrow.

marybaum commented 1 year ago

I lied. I'll do that Thursday or Friday.

marybaum commented 1 year ago

Or two months later. Oops.

Welp—it is in the p2. https://developer.wordpress.org/news/?p=2023&preview=1&_ppp=756268466a

bph commented 1 year ago

Thank you so much @marybaum 🙌

@dmsnell could go over this public preview one last time? It's going to be published on Wednesday morning.

I went over the pre-publish list:

[x] Post Title in Sentence case
[x] Use "Center Column" template
[x] Are Category or Categories selected?
[x] Are Tags identifies?
[x] Is there an explicit Excerpt?
[x] Do all images have an alt-text?
[x] Props added? 🙌 Publish! 📗

-[ ] added a "Resources to learn more" section.

Both of you @marybaum & @dmsnell

What do you think about the excerpt: "All by itself, the HTML Tag processor is better than regular expressions. It's convenient, reliable, fast—and You. Can. Read. It. This article shows you in two examples how to get started using the HTML Tag processor."

Do you think we need a Table of Contents?

marybaum commented 1 year ago

I think that excerpt is great! I have also added a featured image.

dmsnell commented 1 year ago

@bph looks like the post has a number of formatting typos in it still; if it's needed I can go fix those formatting issues. I think there might be some conflicts that arose when copying from the Google document; some things are a bit confusing and don't make sense to me when reading them.

Apart from that I think the author of the post is me, but that's not right. I'd prefer we give credit where it's due 😉.

The way I see it, I'm happy to continue to provide some technical review/feedback, but this has transformed into a post that I'm not really an author or editor of, and I'm happy with that. I think I've provided all the technical review already in the Google document that is informative, so beyond what I've already shared there may not be much new I have to offer - I may just be repeating myself if I try to offer another round of review.

Noted that this is lacking from the pre-publish list, but is valuable anyway:

fix the broken formatting throughout the document. mostly this is the inclusion of literal backticks but there are some HTML tags rendered as text and the code blocks are inconsistent in their syntax highlighting.
remove me as the post author.
there are some things that read confusing or at least technically misleading. I think these are intentional choices at this point, but it could help to ask someone else who works with WordPress code to review the document and share feedback. I only bring this up because I think the post still has some potential to speak to developers.

marybaum commented 1 year ago

I'll take a run through it for the backticks.

marybaum commented 1 year ago

I can understand your discomfort with something this florid. You should hear my internal dialogue when I'm writing my own CSS and JSON files, since I am absolutely irrational about things nobody else notices, like typography.

bph commented 1 year ago

We went through the pre-publish checklist before... When you are ready, @marybaum 🚢 it :-)

And here is your post-publish checklist:

[ ] add Props for reviews to #props channel in WP Slack (Example) (use Slack handles)
[x] add copy for a social post as comment to this issue (https://github.com/WordPress/developer-blog-content/issues/152#issuecomment-1654259045)
[x] close the issue with a comment to link to the published post
[x] close the accompanying discussion with the link to the published post.

justintadlock commented 1 year ago

Adding a new item to the pre-publish checklist:

[x] Set a Featured Image for the post (the new Jetpack social image feature works differently now).

bph commented 1 year ago

Published: The HTML API: process your tags, not your pain

justintadlock commented 1 year ago

Note that I switched the social image to the default template. The chosen template + featured image made the post title really hard to read.

WordPress / developer-blog-content