Search Results Block: Strip markdown from excerpts

ryelle commented 10 months ago

Fixes #465 — The markdown rendered in the excerpt matches one of the legacy notice shortcodes ([tutorial]), which is causing the shortcode processing to output here. It's also unexpected that the markdown would output in the search results, excerpts are typically plain text. This update parses the markdown and strips the resulting HTML, so that these excerpts behave more like native WordPress excerpts.

Before	After

To test:

View a handbook search result, for example /block-editor/?s=deprecation
There should be no broken-looking notices
There should be no markdown

Try on the code reference too, there should be no change.

ryelle commented 10 months ago

Good question, I also spent some time tracking down where the excerpt is generated. The import parsing starts around here, and on line 490 the first sentence of markdown content is used. Maybe it was assumed it will always be plain text? Only after that is the markdown content is parsed to HTML, and that's saved as the post content.

I suppose we could flip those, and pull out the first sentence after parsing as HTML, instead. Strip out the tags, then save it.

adamwoodnz commented 10 months ago

Ah nice, yeah that assumption sounds accurate.

I suppose we could flip those, and pull out the first sentence after parsing as HTML, instead. Strip out the tags, then save it.

Yeah imo it would be more efficient to save the expected content once, at import time, rather than on every page load.

ryelle commented 10 months ago

Okay, I've made an update to wporg-markdown, so that the excerpt processing runs after the HTML parsing.

Index: inc/class-importer.php
===================================================================
--- inc/class-importer.php  (revision 13104)
+++ inc/class-importer.php  (working copy)
@@ -486,12 +486,6 @@
        }
        $markdown = trim( $markdown );

-       // Use the first sentence as the excerpt.
-       $excerpt = '';
-       if ( preg_match( '/^(.+)/', $markdown, $matches ) ) {
-           $excerpt = $matches[1];
-       }
-
        // Transform to HTML and save the post
        $parser = new WPCom_GHF_Markdown_Parser();
        $parser->preserve_shortcodes = false;
@@ -499,6 +493,12 @@

        $html = apply_filters( 'wporg_markdown_after_transform', $html, $this->get_post_type() );

+       // Use the first line as the excerpt, but first strip any HTML.
+       $excerpt = '';
+       if ( preg_match( '/^(.+)/', wp_strip_all_tags( $html ), $matches ) ) {
+           $excerpt = $matches[1];
+       }
+
        add_filter( 'wp_kses_allowed_html', [ $this, 'wp_kses_allow_links' ], 10, 2 );

        $post_data = array(

You can re-parse all handbook pages by adding add_filter( 'wporg_markdown_check_etags', '__return_false' );, then running yarn wp-env run cli "wp cron event run --all"

adamwoodnz commented 10 months ago

Okay, I've made an update to wporg-markdown, so that the excerpt processing runs after the HTML parsing.

I tried to update my local plugin but couldn't so I just applied this patch and it worked 👍

I don't see the change on trunk, assume you still need to ship it?

ryelle commented 10 months ago

I don't see the change on trunk, assume you still need to ship it?

That's right, I added the diff here for "review" since that plugin is trac-based. I figured if that looks good, I'd ship that patch & close this PR.

ryelle commented 10 months ago

Merged into wporg-markdown: https://meta.trac.wordpress.org/changeset/13119

WordPress / wporg-developer

Search Results Block: Strip markdown from excerpts #467