Automattic / jetpack

Security, performance, marketing, and design tools — Jetpack is made by WordPress experts to make WP sites safer and faster, and help you grow your traffic.
https://jetpack.com/
Other
1.58k stars 799 forks source link

Search: add indexing for Gutenberg, embeds, and media filtering #9054

Open gibrown opened 6 years ago

gibrown commented 6 years ago

Gutenberg is around the corner and will have some interesting impacts on searching.

All of our content gets passed through clean_string() (https://github.com/Automattic/wpes-lib/blob/master/src/common/class.wpes-abstract-field-builder.php#L44) and so html tags/comments should get stripped and cleaned up. So Gutenberg shouldn't make anything worse. But there will be lots of meta content that we could do more with.

A big improvement would be to support filtering by blocks. ?s="search terms"&block=video was suggested. I feel like maybe "block" is not the best name here, but not sure what would be better. I think we want to let end users filter by "media".

We already extract shortcodes, and should probably have something that supports both. So all shortcodes that are video should be able to also get matched along with gutenberg video blocks.

Right now we have:

This is all implemented with the Jetpack media extractor: https://github.com/Automattic/jetpack/blob/master/_inc/lib/class.media-extractor.php#L95

Something similar could be used for blocks.

Proposal:

  1. A couple of new fields that can be used for filtering and indexing. Base the new block fields on what we are doing with shortcodes today.
  2. Add a way for end users to filter by "media" where we take the existence of specific blocks/shortcodes and tag the post as having a type of media: pdf,
  3. Adjust search query so that we will match against block names by default, but will not boost on those matches very strongly. Also support matching against the "media" on the post. This requires some form of all_content field as discussed in #8895

For the "media" field we should be able to filter that in the sidebar similar to other fields:

screen shot 2018-03-14 at 10 51 57 am

I am not sure an end user would really want to filter by "block". Feels too granular. A paragraph is a block. Certainly a developer will want to filter by block and so we should have a block_types field but that probably won't be exposed in front end search.

@MichaelArestad @jeffgolenski any thoughts on the design here?

Some methods for getting this built:

gibrown commented 6 years ago

Just adding a note that the current shortcode indexing has some holes in that it only indexes shortcodes that are also recognized by WP.com and so has some limitations for Jetpack sites.

gravityrail commented 6 years ago

Note we already sync registered shortcode slugs under the option name jetpack_callable_shortcodes, and these can be passed directly into get_shortcode_regex()

MichaelArestad commented 6 years ago

I think we could add filtering by "Content types" as well as blocks.

Filtering by content types would make the most sense for most users. For example:

Content types
[ ] Products
[ ] Videos
[ ] Images
[ ] Sounds

And maybe not visible in the UI, but for detailed block searches, adding a block query for specific blocks would be super cool. ?s="search terms"&block=quote

gibrown commented 6 years ago

I wonder if we should have a very flexible UI for defining this sort of thing. So in the widget, rather than letting the user filter on one ES field, we let the user specify particular values. They could then group filters however they want.

The Products/Videos/Images example above is actually multiple different fields:

So the user would need to:

More nested movable items. Yay! If we want to go this way, it is probably a separate issue we should open. The url params for the above may get really ugly though. Many of these fields don't exist in the way that we need them. For example, the image filter needs to be an OR on the shortcode and Gutenblock.

MichaelArestad commented 6 years ago

More nested movable items. Yay! If we want to go this way, it is probably a separate issue we should open. The url params for the above may get really ugly though. Many of these fields don't exist in the way that we need them. For example, the image filter needs to be an OR on the shortcode and Gutenblock.

I think it's worth exploring in a separate issue, but it might be needlessly complex. We want to avoid having to make something like this:

https://ps.w.org/mini-loops/assets/screenshot-1.png?rev=598561

gibrown commented 6 years ago

Ya, it is pretty low priority relative to other things. It will come up again.

stale[bot] commented 5 years ago

This issue has been marked as stale. This happened because:

No further action is needed. But it's worth checking if this ticket has clear reproduction steps and it is still reproducible. Feel free to close this issue if you think it's not valid anymore — if you do, please add a brief explanation.

gibrown commented 4 years ago

so html tags/comments should get stripped and cleaned up

Issue: <br> being removed, causing words to be glued together. Proposed solution: don’t just remove <br>, but replace it with a space.

An extra bug in here is that when we strip tags we sometimes end up concatenating multiple words together. This is because we are using strip_tags()

gibrown commented 3 years ago

We should be taking all embeds and indexing the contents of the embed into the index. Then we can display snippets from the index when matching. So an embedded tweet, can match the text of the tweet rather than the url of the tweet.