Spelling Suggestions for Search

freelawbot commented 10 years ago

Is there a Solr plugin (or something) that would detect misspelled words and make suggestions of correctly spelled words? This is along the lines of the "Did you mean ...?" that many search engines provide. Would be nice.

Bitbucket: https://bitbucket.org/mlissner/search-and-awareness-platform-courtlistener/issue/219
Originally Reported By: Brian Carver
Originally Created At: 2012-04-21T19:38:04.390

freelawbot commented 10 years ago

There is indeed: https://wiki.apache.org/solr/SpellCheckComponent

Though our Solr library doesn't support it yet.

Original Comment By: Mike Lissner

mlissner commented 1 year ago

This is probably out of scope for our migration to Elastic Search, but it'd still be a nice thing to add one day, if it's not too hard, and it's worth having on our radar at least.

albertisfu commented 1 year ago

Got it, I'll be doing some research on how complex it would be to implement this feature. So we can decide if we want to do it now or later.

albertisfu commented 1 year ago

@mlissner Here are my findings about this feature:

In Elasticsearch there is a function called Suggesters that provides that can be used to build this feature it works as follows:

Term suggester.

You need to pass the query and the field where you want to find suggestions of misspelled words: e.g: search_query.suggest('suggest_case_name, cd.get("q", ""), term={'field': 'caseName'})

It's possible to set multiple suggesters to look for suggestions in different fields at the same time, for example, you can look for suggestions in the caseName, judge, and text fields at the same time.

If the suggester identifies misspelled words within the user's query, it will provide suggestions for the fields where it has detected potentially correct terms:

User query: Frdom of Informtion CaseName field suggestions: frdom -> freedom Informtion ->inform

text field suggestions: frdom -> freedom Informtion ->inform

In this example, the suggestions were found in both the caseName and the text field.

From these results, we are able to identify unique suggestions for all possible fields. These can be utilized to form a new query, where misspelled words are replaced with the respective suggestions:

frdom of informtion -> freedom of inform

So before results, it can say something like:

Did you mean freedom of inform? With a link to the suggested query.

So it'll return results for documents with caseName or text containing: Freedom of Information

The only potential issue with this approach is that the suggestions might seem incorrect to the users. This is because the suggestion may not always include the original correct word, particularly when dealing with synonyms or stemmed words. For example, as shown above, the suggestion returns 'inform' instead of 'information'. Despite this difference, the suggested query would still function properly since it'll match indexed terms.

Another example:

User query: Fredom Fredland

CaseName field suggestions: fredom-> freedom

judge suggestions: fredland -> friedland text field suggestions: fredom-> freedom fredland -> friedland

Applying the same process of selecting unique suggestions for all fields and replacing terms in the query the suggested query will be:

Did you mean freedom friedland?

Phrase suggester

The other possible option is to use a Phrase suggester, instead of returning single terms suggestions, it'll return one or more whole ranked phrase suggestions based on the user input, e.g:

User query: Frdom of Informtion

CaseName suggestions: 1.- freedom of inform 2.- frdom of inform 3.- freedom of informt

So the suggestion could be the first one: Did you mean freedom of inform?

The Phrase suggester is supposed to be capable of rearranging the terms within a query to generate more accurate results. However, in this example, it didn't rearrange the terms (indexed case name: Freedom of Information).

User query: Informtion of Frdom

CaseName suggestions: 1.- inform of freedom 2.- inform of frdom 3.- informt of freedom

So, the effectiveness of the Phrase Suggester may also depend on the volume of indexed data, as well as adjustments needed to certain parameters to enhance the quality of the suggestions. As phrase suggestions are ranked, it'll make sense to select and show the top suggestion to the user. Nevertheless, there may be scenarios where the highest-ranked suggestion does not contain the original meaning of the user's query.

Another observation I made is that the suggestions may not function optimally with some short names, as exemplified in the following cases:

Hng (Hong) Jse (Jose)

The suggester did not return suggestions for these terms. I read that the ES suggestion mechanism uses edit distance, which may not function efficiently on shorter terms. Furthermore, if these terms are not frequent within the indexed data, it could also result in them not being suggested.

In a brief:

This feature seems easy to implement considering the process described above. We'll just need to choose which kind of suggester we want to use the term of the phrase suggester.

Let me know what you think.

mlissner commented 1 year ago

Thanks, this is useful. We'll put this as a phase two kind of thing. I'm also curious how it affects performance (which will be easier to test once we have a huge index), and how we will decide whether to show suggestions. Google uses three possible responses:

The suggestions aren't interesting; don't show anything about them to the user.
The suggestions are interesting; show the "Did you mean?" prompt.
The suggestions are so useful, we're just going to run them for you and you can do your original query if you really want to.

Fun stuff. We should do this eventually.

albertisfu commented 1 year ago

Got it. Yes, suggestions can definitely have an impact on search performance. This is because Elasticsearch needs to execute additional queries in order to generate suggestions. The more fields we want to include in the suggestion search, the more time it will require.

Once we reach that point, yes, it'll be easier to measure the performance impact on a large index. We can compare different combinations, such as running a query without suggestions, looking for suggestions in one field, or searching across multiple fields. This way, we can determine the best approach for generating suggestions.

As for whether to display the suggestions, Elasticsearch ranks them with a score, so we can analyze these scores and establish a threshold for deciding when to show the suggestions and when not to.

freelawproject / foresight