Open freelawbot opened 10 years ago
There is indeed: https://wiki.apache.org/solr/SpellCheckComponent
Though our Solr library doesn't support it yet.
Original Comment By: Mike Lissner
This is probably out of scope for our migration to Elastic Search, but it'd still be a nice thing to add one day, if it's not too hard, and it's worth having on our radar at least.
Got it, I'll be doing some research on how complex it would be to implement this feature. So we can decide if we want to do it now or later.
@mlissner Here are my findings about this feature:
In Elasticsearch there is a function called Suggesters that provides that can be used to build this feature it works as follows:
You need to pass the query and the field where you want to find suggestions of misspelled words: e.g:
search_query.suggest('suggest_case_name, cd.get("q", ""), term={'field': 'caseName'})
It's possible to set multiple suggesters to look for suggestions in different fields at the same time, for example, you can look for suggestions in the caseName
, judge
, and text
fields at the same time.
If the suggester identifies misspelled words within the user's query, it will provide suggestions for the fields where it has detected potentially correct terms:
User query: Frdom of Informtion
CaseName field suggestions:
frdom -> freedom
Informtion ->inform
text field suggestions:
frdom -> freedom
Informtion ->inform
In this example, the suggestions were found in both the caseName
and the text
field.
From these results, we are able to identify unique suggestions for all possible fields. These can be utilized to form a new query, where misspelled words are replaced with the respective suggestions:
frdom of informtion
-> freedom of inform
So before results, it can say something like:
Did you mean freedom of inform? With a link to the suggested query.
So it'll return results for documents with caseName
or text
containing: Freedom of Information
The only potential issue
with this approach is that the suggestions might seem incorrect to the users. This is because the suggestion may not always include the original correct word, particularly when dealing with synonyms or stemmed words. For example, as shown above, the suggestion returns 'inform' instead of 'information'. Despite this difference, the suggested query would still function properly since it'll match indexed terms.
Another example:
User query: Fredom Fredland
CaseName field suggestions:
fredom-> freedom
judge suggestions:
fredland -> friedland
text field suggestions:
fredom-> freedom
fredland -> friedland
Applying the same process of selecting unique suggestions for all fields and replacing terms in the query the suggested query will be:
Did you mean freedom friedland?
The other possible option is to use a Phrase suggester, instead of returning single terms suggestions, it'll return one or more whole ranked phrase suggestions based on the user input, e.g:
User query: Frdom of Informtion
CaseName suggestions: 1.- freedom of inform 2.- frdom of inform 3.- freedom of informt
So the suggestion could be the first one: Did you mean freedom of inform?
The Phrase suggester is supposed to be capable of rearranging the terms within a query to generate more accurate results. However, in this example, it didn't rearrange the terms (indexed case name: Freedom of Information).
User query: Informtion of Frdom
CaseName suggestions: 1.- inform of freedom 2.- inform of frdom 3.- informt of freedom
So, the effectiveness of the Phrase Suggester may also depend on the volume of indexed data, as well as adjustments needed to certain parameters to enhance the quality of the suggestions. As phrase suggestions are ranked, it'll make sense to select and show the top suggestion to the user. Nevertheless, there may be scenarios where the highest-ranked suggestion does not contain the original meaning of the user's query.
Another observation I made is that the suggestions may not function optimally with some short names, as exemplified in the following cases:
Hng
(Hong)
Jse
(Jose)
The suggester did not return suggestions for these terms. I read that the ES suggestion mechanism uses edit distance, which may not function efficiently on shorter terms. Furthermore, if these terms are not frequent within the indexed data, it could also result in them not being suggested.
This feature seems easy to implement considering the process described above. We'll just need to choose which kind of suggester we want to use the term
of the phrase
suggester.
Let me know what you think.
Thanks, this is useful. We'll put this as a phase two kind of thing. I'm also curious how it affects performance (which will be easier to test once we have a huge index), and how we will decide whether to show suggestions. Google uses three possible responses:
Fun stuff. We should do this eventually.
Got it. Yes, suggestions can definitely have an impact on search performance. This is because Elasticsearch needs to execute additional queries in order to generate suggestions. The more fields we want to include in the suggestion search, the more time it will require.
Once we reach that point, yes, it'll be easier to measure the performance impact on a large index. We can compare different combinations, such as running a query without suggestions, looking for suggestions in one field, or searching across multiple fields. This way, we can determine the best approach for generating suggestions.
As for whether to display the suggestions, Elasticsearch ranks them with a score, so we can analyze these scores and establish a threshold for deciding when to show the suggestions and when not to.
Is there a Solr plugin (or something) that would detect misspelled words and make suggestions of correctly spelled words? This is along the lines of the "Did you mean ...?" that many search engines provide. Would be nice.