backdrop / backdrop-issues

Issue tracker for Backdrop core.
144 stars 39 forks source link

[UX] Implement N-Gram/Porter-Stemmer/Fuzzy/Partial-Word search in core. #1320

Open klonos opened 8 years ago

klonos commented 8 years ago

As it is now, the default Drupal/Backdrop search works only with exact search terms. There are a few solutions for Drupal out there, but given our target audience of small businesses and affordable shared hosting environments, things like Apache Solr implementations etc. are not always possible.

As part of the efforts/solution in this issue, a nice-to-have would be solving some of the search-related issues in backdropcms.org, or at least improving the situation. Things like the funny search results in the modules search page for instance: https://github.com/backdrop-ops/backdropcms.org/issues/95#issuecomment-152771301

Here are some acceptable solutions that I believe could be merged/implemented.

Partial Word Search

Respective d.org issue: https://www.drupal.org/node/498752

It was at some point marked as duplicate of the N-Gram issue (see below), but it is a different approach and it has a working D7 patch (according to other people's comments - have not tested myself).

Allow wildcards like * in search

Respective d.org issue: https://www.drupal.org/project/drupal/issues/487764

Started as a D6/D7 request - has a starting patch for D8 - currently slatted for 11.x, but very few comments and activity overall.

Allow users to use advanced search without keywords entered

Respective issue in d.org: https://www.drupal.org/project/drupal/issues/1126688

Problem/Motivation

When using Advanced Search on a Node search page, you have to enter keywords even if there are other valid search conditions chosen. You shouldn't have to.

This has been shifting between D7 and D8, and has patches for both. No recent activity though.

N-Gram

Respective d.org issue: https://www.drupal.org/node/103548

Started as D6/7 but had patches only for D6 - was then moved to D8 and later closed as wontfix for performance reasons. At some point re-opened, and has some recent activity and patches agains D10.

Search is working, but I noticed that it doesn't pick up on partial words. For example, if you search on 'quake' you would expect to get back results containing the term 'earthquakes' but there are no results.

This behavior is also the case with plurals: Searching on 'engineer' when a node only includes 'engineers' will not return that node in the results. It is pretty standard searching behaviour for people to omit plurals and expect to see them in results. For example, searching on 'engineer' should return:

engineers engineering engineer's

...

It implements N-Gram searching. Basically, N-Gram will break the words into N-size pieces and those will be whats added to the index. Thus standard search queries will work as is but will find partial words. The actual code to do that all the N-Gram stuff is really small, the stop words take up most of the lines of code in that example.

For example, N = 3, It will take Bart and break it into Bar, art. And it will take Bart@simpson.com and break it into Bar, art, rt@, t@s .....

And the way search api is written, all queries will be broken into 3-grams as well. So when you search for Bart, it will really search for words in the search index matching Bar, art, thus returning bart@simpson.com

Same is true for earthquake and quake. The module works by trading off the size of the search index for query speed. Which, in my experience is acceptable. Much MUCH MUCH better then LIKE queries /me shivers. I make use of this on a couple small sites that I have deployed with strong results. Pretty elegant module, simple and small.

N-Grams are language independent as well. (Where as, stemming is very language dependent.) Though, the stop words part of that module is language dependent.

Porter-Stemmer

There is a D7 module: https://www.drupal.org/project/porterstemmer

Have tried this myself in the past, but it is language specific:

... Limitations and Notes

  • The Porter stemming algorithm has a few parts that work better with American English than British English, so some British spellings will not be stemmed correctly. It is also definitely English-specific, and non-English content will not be stemmed correctly. ...

...and has other limitations too that the N-Gram/partial-word search does not have, such as:

...

  • The Porter stemming algorithm attempts to reduce words to their lingustic root words -- it does not do general substring matching. So, for instance, it should make "walk", "walking", "walked", and "walks" all match in searching, but it will not make "walking" a match for "king".

Fuzzy Search

D7 module: https://www.drupal.org/project/fuzzysearch

This module provides drupal sites with a fuzzy search engine to allow for broader keyword matches including partial or misspelled keywords.

This one seems to be an implementation of the N-Gram solution as a module instead of a core patch:

Fuzzy matching is implemented by using ngrams. Each word in a node is split into 3 (default) letter lengths, so 'apple' gets indexed with 3 smaller strings 'app', 'ppl', 'ple'. The effect of this is that as long as your search matches X percentage (administerable in the admin settings) of the word the node will be pulled up in the results.

Features (highlighting and respective backdrop issues linking by me):

  • Misspellings and typos still provide relevant results.
  • External scoring factor hooks exposed so contrib modules can give administrators options for scoring.
  • Re-index function available to allow modules to specifically call a certain node for re-indexing at next cron run.
  • Indexing of CCK textfield field types and taxonomy terms.
  • Implements hook_nodeapi's 'update index' op, so current modules integrating with core search will work the same.
  • Works independently of core search.
  • Block provides related content type results from url query (#1317).
  • Ngram length is configurable.
  • Content types may be excluded from results.
  • Fuzzy highlighting of misspelled search terms.
jlfranklin commented 6 years ago

Fuzzy Search requires Search API. Portstemmer doesn't, but I'd bet it could easily be integrated into Search API. The other two are patches to core that tweak the SQL query performing the search, but I don't see any way they score results, and I'd expect the results to be sorted exact matches over starts-with over partials.

Conceptually, I think integrating a search API would be a great first step to adding all of the above as contrib modules.

I say, "conceptually", because I used the Search API module betas in a Drupal 8 site about a year and a half ago. It wasn't an easy module to use, and each upgrade seemed to break our search capabilities in new and weird ways. I can't wholeheartedly endorse integrating the Search API module into core entirely because of my bad experience with it, but maybe we could use it as a starting point or take away some lessons learned from it when designing our own.

izmeez commented 3 months ago

Yes, that issue has a working Drupal 7 patch for Partial Word Search that we tested and used.

Respective d.org issue: https://www.drupal.org/node/498752 @klonos

It was at some point marked as duplicate of the N-Gram issue, but it is a different approach and it has a working D7 patch (according to other people's comments - have not tested myself).

izmeez commented 3 months ago

The patch is only 4 lines. I can feel myself getting distracted again. Didn't I already have something I was working on?

izmeez commented 3 months ago

Manual apply of Drupal 7 patch to Backdrop is easy and works. I will need to create a PR for testing.

izmeez commented 3 months ago

Added PR. To test login in and go to /search/node and enter a partial word into the search field, like exist and view results.

izmeez commented 3 months ago

I have added a PR for testing and review but I do not know how to link it to this issue as a possible fix. Would appreciate advice on how to do that. Thanks.

izmeez commented 3 months ago

Changing the opening comment of the PR as suggested in zulip chat to: Fixes https://github.com/backdrop/backdrop-issues/issues/1320 triggers creating the link to the PR in the issue header. Thanks.

klonos commented 3 months ago

The patch is only 4 lines. I can feel myself getting distracted again. Didn't I already have something I was working on?

Story of my life! 😅

I have added a PR for testing and review ...

Thanks @izmeez 🙏🏼 ...code looks good and matches the patch that was provided a while ago in the drupal core issue queue 👍🏼 (which had already been tested by several people, who reported it working for them - it never moved because it was marked as a feature request, and these were not accepted in D7 core after active development had started in D8+ 😞)

Anyway, I'll test this and report back.

klonos commented 3 months ago

Works for me 👍🏼

...side-by-side screenshot of a 1.28.0 vanilla site (demo sandbox) vs the PR sandbox: image

klonos commented 3 months ago

...since the PR is merely an improvement to allow a very simplistic/light approach to partial word matching, I am categorizing this as a task. Since it is a UX improvement that makes core behave as one would expect out of the box, I believe that we should get it in with a bug fix release, and leave larger plans such as N-Gram/Porter-Stemmer/Fuzzy search as follow-up features.

izmeez commented 3 months ago

When this gets in it will fix #844 since that is content on a Backdrop site and it will also apply to all searches in the forum.

olafgrabienski commented 3 months ago

Works for me at first sight, very promising! That said, I'd like to discuss two issues:


Standard matches are highlighted:

image


Partial matches are not highlighted:

image

klonos commented 3 months ago

Thanks @olafgrabienski for testing and for pointing out the highlighting issue 🙏🏼

@izmeez, do you feel up to venturing on a quest to try to sort that out?

herbdool commented 3 months ago

The PR will also need tests.

In reading the d7 issue on the partial words, I see that there's a downside to this approach where it will return results like "walking" if someone searches for "king". For some people that's fine but we can't assume everyone is.

There's also an issue for adding wildcard capability https://www.drupal.org/project/drupal/issues/487764 which is an other alternative to the ones listed above.

Also, I note that Fuzzy Search module is now ported. It depends on Search API so probably not a candidate for core.

klonos commented 3 months ago

In reading the d7 issue on the partial words, I see that there's a downside to this approach where it will return results like "walking" if someone searches for "king". For some people that's fine but we can't assume everyone is.

Fair enough. We can make this optional by adding a new "Allow partial word matching":

That feels like it would be more of a new feature then though (rather than considered improving existing functionality or fixing "broken" functionality).

There's also an issue for adding wildcard capability https://www.drupal.org/project/drupal/issues/487764 which is an other alternative to the ones listed above.

Thanks. I've added it to the issue summary here.

Also, I note that Fuzzy Search module is now ported. It depends on Search API

Yeah, but this issue here is about improving the core search functionality - without using the Search API module.

izmeez commented 3 months ago

I had three after thoughts:

  1. It may be necessary to Rebuild search index on sites with existing content. (Not confirmed).

  2. Unrelated to this issue, search results reveal author and date info. This may require permissions and access check and/or theming.

  3. Partial words not highlighted in results like full words are. (As noted by @olafgrabienski ). Not sure how to address this, but it would be a nice addition.

The suggestion to use wildcards in the search as described in the drupal issue has been around for a long time. It doesn't appeal as it is not a simple UX. Ordinary people do not add wildcards when searching, they use partials.

A comment of mine from the Drupal issue:

Nonetheless, in our experience many users need either partial word searches or porter stemmer approach (which is not good with multi-language sites) or both. In real life the core search is not adequate for many end users.

and a helpful response:

regarding multilingual linguistic stemming... in Drupal 7, stemming was a problem for multilingual because the stemming modules run off hook_search_preprocess() and that did not have a $langcode parameter. In Drupal 8 that parameter has been added to the hook, so stemming modules should hopefully be able to run side-by-side in a multilingual site, each stemming content in their particular language.

On the question of will people be happy with results such as walking when searching for king that's an unknown. It may be solved by searching with addition of a space king to search for the partial as a prefix. With questions like these I'm not sure where one would begin with tests.

We have used partial word search on Drupal sites for over 15 years. It has been useful not just for searches on content but also users.

izmeez commented 3 months ago

Here is a related Drupal 8 fix, search_excerpt() doesn't highlight words that are matched via search_simplify() with a final comment:

The Drupal 7 port needs to be somewhat different from the Drupal 8 code, because in Drupal 7, there is no $lancode for search_excerpt. Hence, most of the test failures.

Also, we need to make sure to use as a starting point the current Drupal 8 code, not whatever was in the patch on this issue. The reason is that after this was done, #2301715: search_excerpt() uses strrpos() incorrectly was also fixed, which changed the code somewhat. We don't want to reopen that other issue for Drupal 7; we just want to make sure that the code for backport to Drupal 7 on this issue is the current, fixed Drupal 8 code.

Before the above comment there was an attempt at a D7 patch, https://www.drupal.org/project/drupal/issues/916086#comment-9359421

izmeez commented 3 months ago

@klonos

We can make this optional by adding a new "Allow partial word matching":

* either as a setting in `admin/config/search/settings` that changes the behavior globally for all searches.

* or by adding it as a checkbox in the various search forms at `search/*`, so that people can choose whether to include partial matching or not with each search they perform.

This is a good idea. I would favour the first of these.

klonos commented 3 months ago

Not sure how to address this, but it would be a nice addition.

I have been experimenting with this on my local. Here's what I currently have: image As you can see from the screenshot:

The code in search_excerpt() made my head hurt (you gotta love all these single-letter variables 🙄 ...and no comments to explain why we are limiting the characters in the highlighted text to 256 ...and then comments like // Let translators have the ... separator text as one chunk. which only explain the "what" but not the "why" 🙄 🙄 ) ...so I've simplified it quite a bit.

Not sure if the changes would be acceptable, but I'll polish it a bit further and provide it here as a PR for future use/inspiration anyway.

PS: would need some extensive testing with multilingual sites. Things to test:

klonos commented 3 months ago

BTW, thank you for the additional research and links to d.org issues and patches @izmeez 🙏🏼 ...it helps to see what others have tried before.

izmeez commented 3 months ago

@klonos You are someone braver than most to venture into the depths of code.

klonos commented 3 months ago

I'm temporarily assigning this to myself, to polish what I have on my local and push it in a PR, but not 100% confident that I can drive this to end (I might need help with tests for example).

Also changing it back to a feature request.

izmeez commented 3 months ago

Testing the core search with the partial word patch backported from drupal 7 and also the contrib module porterstemmer. The porterstemmer module suffers from the same problem of not highlighting parts to illustrate the connections.

A quick glance suggests there may be some searches that fail with core and the patch and work with addition of the module. I'll report details soon.

izmeez commented 3 months ago

I haven't retested this. With existing content including the word managers plural, core search for manager fails and surprisingly so did the partial word search PR patch. Only when the porterstemmer contrib module was enabled did the searches work. I had done some back and forth and don't know if residual uninstalled pieces could have played a role.

olafgrabienski commented 3 months ago

@izmeez In the PR sandbox search for "manager" works for me, when there is content with the term "managers". Did you rebuild the search index after adding the term "managers"? (There is a button in the search configuration, another way is running cron.)

klonos commented 3 months ago

There's also an issue for adding wildcard capability https://www.drupal.org/project/drupal/issues/487764 which is an other alternative to the ones listed above.

@herbdool I have taken a proper look into that issue. I agree that it can be useful and would like us to consider it for inclusion, but the use of asterisks sounds like something rather advanced. Besides, I think that you can achieve most of what that does with partial matching. Breaking down the things proposed over in that d.org issue:

  1. you might want to search for just * (with some advanced search filters): that is already supported as far as I can tell - you just need to expand the "Advanced search" fieldset, and add a phrase in the Containing the phrase field or a word in the Containing none of the words field. The only problem I can see, which this request might be trying to address, is this: image

    ...which is what happens if you add a word in the Containing none of the words field so that it can be excluded from the search without specifying any words to include at the same time. The same happens if you add a minus/dash in front of a single keyword in the regular search field. (btw, see #1630 and https://github.com/backdrop/backdrop/pull/1233 where we addressed some related bugs around that).

  2. search for *word* to find words containing "word" somewhere inside them: yup, that's exactly what partial word matching does - implementing something like this would be using asterisks to explicitly use that feature. I would prefer an exposed "allow partial word matching" checkbox instead, which would make the feature more obvious ...and more than that, I'd prefer this to be the default functionality really.
  3. or search for foo*bar to find words that start with foo and end with bar: that would be something new indeed.

So perhaps it would be a good idea to only limit any such asterisk-based wildcard behavior to the "Advanced search" section/fieldset? 🤔 ...I mean, most people are used to searching for things in Google and their browser address bar or via CTRL+F, and these things usually support partial matching OOTB (as they do fuzzy/stemming etc.) without needing to use any sort of special/advanced characters. Sure, there is support for advanced search (wrapping words in double quotes for exact matching, as well as adding a minus/dash in front of a word to exclude it from the search come to mind for instance), but that is not something exposed in the UI as an "everyday" feature.

izmeez commented 3 months ago

@olafgrabienski Thanks for checking, I haven't had time to do so myself.

izmeez commented 2 months ago

I was just learning more and when I updated my personal backdrop repo and rebased the issue fork I received a message that tugboat had created a new version. I took a look and apart from some errors I don't fathom it seems to be working. Of course, without the goodness that @klonos has promised. I see I had started on adding an option for this to the UI and haven't finished that.

izmeez commented 1 month ago

I have done more work on adding this as a global option on the search settings page. In doing so I have found there is more I need to learn about doing this right in backdrop and becoming more familiar with using Form API (FAPI). I will try to reach out to Office Hours to see if someone can help me with this, or if someone is available to do this outside hours. I'm not looking for someone else to do it, I'd like to learn more about how to do it.

izmeez commented 1 month ago

There will of course still be the extra goodness that @klonos has shown with highlighting the partial results, so he will still need to be there for that, later.

klonos commented 1 month ago

I will try to reach out to Office Hours to see if someone can help me with this, or if someone is available to do this outside hours. I'm not looking for someone else to do it, I'd like to learn more about how to do it.

Hey @izmeez 👋🏼 ...did you get the assistance you needed with the Form API on this? I'd be happy to help you learn more. @stpaultim showed interest in the same and we had a short session at some point in the past, but I don't think that it was recorded. Perhaps we can work on that during the next office hours.

I'm temporarily assigning this to myself, to polish what I have on my local and push it in a PR, but not 100% confident that I can drive this to end (I might need help with tests for example).

I never actually gotten to that after all 😞 ...I'm currently on leave and on summer-lazy mode, but I'll try to find time in-between having summer fun, to have some Backdrop fun. It'd be nice to get this nice feature in for 1.29.0, so I'm un-assigning myself and back to you @izmeez. I'll still push what I have so you can get some inspiration for your PR, but the point is to try to get the basic functionality in, and then perhaps follow-up with separate issues to improve on it 😉

klonos commented 1 month ago

...I checked what I have on my local, and it is all related to generating and trimming/styling excerpts of the search results (basically highlighting all matches in each individual excerpt). The code is rather complex and I don't believe that it can be simplified much. So, since I don't want to derail the basic functionality discussed here, we should focus here on:

Then we can consider the following as follow-up issues/PRs:

izmeez commented 1 month ago

@klonos

  • cross-porting https://www.drupal.org/node/498752 (which is really simple and easy to review/test)

  • adding a setting in admin/config/search/settings to turn this feature on/off

  • deciding if this should be on of off by default on new installations

The first is done with a PR that works. The second is where I'm stuck. I considered radios and checkbox and think a checkbox is a more simple UI and better for this. The third, whether it is on by default. While I think there is a lot of merit to this feature, it should probably be off by default for backward compatibility.

On the followup to highlight partial matches, I believe you looked at the 2 related Drupal issues and then came up with a more elegant solution. We should go with that if possible. It would be nice to have in the feature right away.

nice-to-have's:

manual: exposing some sort of "show partial matches" checkbox in the actual search form at /search automagic: returning partial matches if no actual exact matches have been found + throwing a message to indicate that

The nice-to-have's could be left for a follow-up. Obviously it would depend on the partial search not being enabled globally and then a site admin might want to see stats on how many times it gets used. I can see how this may be a good item for end users if the admin has not enabled partial search globally.

izmeez commented 3 weeks ago

With assistance from @klonos the PR has been updated to include a checkbox to enable partial word search on the global search settings config page at admin/config/search/settings. The PR needs testing and review.

Plans discussed today are that @klonos will look at adding the feature to highlight partial word matches that he demonstrated as a proof of concept in a previous comment, https://github.com/backdrop/backdrop-issues/issues/1320#issuecomment-2148030211. This will be a great addition to the feature.

Subsequently, probably as a followup issue, it may be worth adding an option to the search results page at /search/node/type or search/user whereby the user can enable partial word search to results if it has not been enabled globally or to automagically provide partial word search results when full word search results return none.

izmeez commented 3 weeks ago

The search/user page may already return partial matching results and not be affected by this PR.

olafgrabienski commented 3 weeks ago

The checkbox to enable partial word search works for me. Looking forward to the highlighting of partial words.

We should maybe have a new look at the search interface text then, e.g. if using the term "word" makes still sense in every situation.

herbdool commented 3 weeks ago

There's some promise here but also some concerns, some of which have already been raised on the Drupal issue.

indigoxela commented 2 weeks ago

I've been asked in our chat to add my comment here:

Word stemming is always related to the actual language. That makes it difficult to implement in core.

Partial word search, on the other hand, (adapting the query position of the % placeholder) works for all languages. But ideally, the exact match should have a higher position in the result list.

I like the direction of this PR. :+1: Still some work to do, though, as @herbdool already pointed out in the previous comment.

stpaultim commented 2 weeks ago

Need to ensure that this doesn't have a big negative performance impact.

This was discussed in the last dev meeting. @quicksketch said that he doesn't think it will have any performance impact.