Automattic / jetpack

Security, performance, marketing, and design tools — Jetpack is made by WordPress experts to make WP sites safer and faster, and help you grow your traffic.
https://jetpack.com/
Other
1.59k stars 798 forks source link

Search: allow matching on first 2 chars of second word #16120

Open gibrown opened 4 years ago

gibrown commented 4 years ago

Currently the index doesn't match until after the first two chars in many cases. Mostly because the size of the index explodes. This is probably ok for the first two characters, but not great on the second word where something like "smart c" and "smart co" does not match "smart coupon". We can probably improve this by running multiple searches on the api when nothing matches and try stripping off the extra chars. Better to show some results than none.

We could also look (again) at lowering the number of chars from 3 to 2, but the last time I tried that really exploded the index size: https://github.com/Automattic/wpes-lib/blob/master/src/common/class.wpes-analyzer-builder.php#L384

There are some other improvements we could make to that indexing if we were on a newer version of ES also.

gibrown commented 3 years ago

Some other cases where this doesn't work is for:

gibrown commented 2 years ago

Newer version of ES should let us do this in a much more performant way by using keyword_repeat analyzer filter in the all_content field. We could also turn off stopwords, and use this to better match against stemmed words.

robfelty commented 2 years ago

With my current implementation, we can now match words like "pi" and "no", but still not the first two characters of longer words. Given that previous experiments with switching to allow for 2-15 edge ngrams increased the size of the index too much, I think we should stick with 3-15 for now.

gibrown commented 2 years ago

Updated the description to focus on the second word. I think we can play with the query to be able to handle this, but we decided to punt on figuring it out.

"smart co" not matching "smart coupon" feels like a bug to me while not matching "co" to "coupon" feels ok.