Closed mlissner closed 1 year ago
Checking this issue, I've confirmed this is also happening in ES, there are some possible solutions.
algorithmic stemming
, so stemming is based on the dictionary. Although, dictionary-based stemming can be slower and use a significant amount of RAM, also the quality of stemming depends on the dictionary quality.@mlissner let me know what you think.
Dictionary stemming is interesting. I'm a bit surprised they worry so much about memory usage, and I'm a bit concerned that hunspell
kind of sucks. (I've contributed dozens of legal words to it and still encounter missing ones all the time. It's what Firefox uses, so whenever a word isn't in hunspell as I'm writing an email, I go and report it. Even now, I'm noticing that "hunspell" isn't in Firefox—I guess I'll not report this today?)
Anyhow, even setting that aside, I still feel like it's not great? I think this is more what I had in mind:
Did you see this?
Got it! so the main issue about using hunspell
is the dictionary quality.
I checked in detail the ES recipe you shared. I did some tests and I've identified the following observations:
This approach basically works by adding a multi-field
with a variation of the field that won't be affected by stemming.
judge = fields.TextField(attr="judges", analyzer="text_en_splitting_cl",
fields = {
'exact': fields.TextField(attr="judges", analyzer="english_exact"),
}
In this example judge
is stemmed while judge.exact
remains unaltered.
When performing a query, it is necessary to specify the field in which you wish to search for either the stemmed or the exact version of the term.
To provide users with the option to choose whether to search for the exact version or the stemmed version of a term, it's possible to use the quote_field_suffix=".exact"
property in the query_string
. This way, when a user searches for a query like information "deposition"
, the term information
will be stemmed and searched within the stemmed version field judge
, while deposition
will remain unstemmed and be searched within the exact version field judge.exact
. This approach ensures that only results containing deposition
are returned, while documents containing the term deposit
are excluded.
But I believe there are some issues with this current approach:
quote_field_suffix
property is only available in the query_string
query, not in the multi_match
query. Since the multi_match
query is used to retrieve improved results by selecting the score from the best-matched field, a workaround would be to only utilize the exact fields in the multi_match
query. However, this may adversely affect the final score in queries where stemmed words should actually be matched.TextFields
in order to work properly.Let me know what you think.
And the main concern is the significant increase in index size when employing this approach.
Yeah, I kind of thought this would be an issue. Two questions:
Does it make sense to only do this on the full text field (or do we even have a full text field still)? I don't think most fields need this, but the full text kind of does.
I guess there are a couple parts to the index. One is the inverted word index and the other is the actual terms. For this, I'd expect we'd just have to add more tokens to the inverted word index, but my question is, if document 1 has the phrase "A worm eats information near a deposition" does the duplicated field make an index like this (bad approach; 14 rows):
token | position | document number |
---|---|---|
a | 1 | 1 |
worm | 2 | 1 |
eat | 3 | 1 |
inform | 4 | 1 |
near | 5 | 1 |
a | 6 | 1 |
deposit | 7 | 1 |
a | 1 | 1 |
worm | 2 | 1 |
eats | 3 | 1 |
information | 4 | 1 |
near | 5 | 1 |
a | 6 | 1 |
deposition | 7 | 1 |
Or like this (better approach; only adds distinct unstemmed terms; 10 rows):
token | position | document number |
---|---|---|
a | 1 | 1 |
worm | 2 | 1 |
eat | 3 | 1 |
eats | 3 | 1 |
inform | 4 | 1 |
information | 4 | 1 |
near | 5 | 1 |
a | 6 | 1 |
deposit | 7 | 1 |
depositition | 7 | 1 |
In other words, does it duplicate the size of the index or does it only make it larger where the terms differ?
Does it make sense to only do this on the full text field (or do we even have a full text field still)? I don't think most fields need this, but the full text kind of does.
I did some tests related to this question. When searching in ES, we utilize the "text" field for displaying snippets and highlights, and is also searchable in the "query_string" query.
So It would make sense to only add the exact
version to the "text" field if it were the exclusive field we intend to search in.
Currently in addition to the query_string
we use a multi_match
that looks like this:
Q(
"multi_match",
query=value,
fields=[
"caseName",
"docketNumber",
"court",
"judge",
"sha1",
],
type="phrase",
operator="AND",
tie_breaker=0.3,
)
This query allows us to return better results and mimc the Solr search behavior after applying boosting. Since the multi_match
returns the best score from the matched fields.
Let's consider the document:
Doc1: "A worm eats information near a deposition" Fields indexes: caseName: "a worm eats inform near a deposit" text: "... a worm eats inform near a deposit ..." text.exact: "... a worm eats information near a deposition ..."
Doc2: "A worm eats information near a deposit" Fields indexes: caseName: "a worm eats inform near a deposit" text: "... a worm eats inform near a deposit ..." text.exact: "... a worm eats information near a deposit ..."
If we perform a search for "deposition", the multi_match
and the query_string
(that also looks over multiple fields) the query will match both documents as they will look into "caseName" for the word root "deposit".
So it would be possible to only add the exact
version to the "text" field if we only search in this field excluding all the other fields. However, this approach may likely affect search quality, as the score will be calculated solely based on the "text" field. Further investigation would be necessary to accurately measure the real impact.
An alternative is to add the exact
version to all the "Text" fields.
So that the stemmed and unstemmed versions are available for every field.
On the query_string
we could use the quote_field_suffix
so the search is performed on the "exact" fields if the user uses quotes ""
in the query.
quote_field_suffix
is not available for multi_match
query, but we can accomplish a similar behavior by identifying quotes in a query and changing the fields where the multi_match
will look for, to the "exact" version.
[
"caseName.exact",
"docketNumber.exact",
"court.exact",
"judge.exact",
"sha1.exact",
]
I guess there are a couple parts to the index. One is the inverted word index and the other is the actual terms. For this, I'd expect we'd just have to add more tokens to the inverted word index, but my question is, if document 1 has the phrase "A worm eats information near a deposition" does the duplicated field make an index like this (bad approach; 14 rows):
Yes, I did some research about this.
The actual terms that are stored in the _source
are not affected, it remains the original version.
About the inverted index, it's not clear how it works. In documentation it just says that if a field has two versions, two documents are indexed.
And on here it says: multi-field in order to have the same content indexed in two different ways
.
A elasticsearch webinar mentions that an inverted index is generated for each version of a field:
So that they would be something like:
Document case_name: information near a deposition
For the "caseName" field:
"inform" -> doc1 (caseName)
"near" -> doc1 (caseName)
"a" -> doc1 (caseName)
"deposit" -> doc1 (caseName)
For the "caseName.exact" field:
"information" -> doc1 (caseName.exact)
"near" -> doc1 (caseName.exact)
"a" -> doc1 (caseName.exact)
"deposition" -> doc1 (caseName.exact)
However, these might just be didactic representations for a better understanding. It seems that there is no way to inspect the inverted indexes in Elasticsearch as they are optimized and not human-readable. So internally, it's possible that they are further optimized to prevent the duplication of terms.
So it sounds like the index will definitely be larger if we do this, but I think we should give it a try anyway, because it's a killer feature people want.
I'm not sure I understand everything about query_string
vs multi_match
. Let's chat tomorrow or if you get to this and want to provide some examples or further explanation, that'd be great. If we can, I want to use quote_field_suffix
, since it would allow a query like...
"information" letter
...to have exact matching on "information"
and stemmed matching on letter
. That's cool. I'm afraid if we do our own term detection, we won't be able to do that.
Thanks. So I'll be working on this in order to implement the exact
field for the TextField
and use quote_field_suffix
.
I'll gather some examples where the multi_match
is required, right now some tests fail if it's removed but maybe it can be removed by doing some tweaks to the query_string
.
A user is reporting that when they search for
deposition
, they get results containingdeposit
. This is pretty impressively bad, gotta say, but maybe there's a common root there somewhere. Must be.Anyway, we should ease back the stemming or have a way of turning it off someday.
I'm happy to have other examples here too. Just beware that synonyms behave similarly sometimes.