freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
535 stars 148 forks source link

Stemming is awfully eager in our search engine #1886

Closed mlissner closed 1 year ago

mlissner commented 2 years ago

A user is reporting that when they search for deposition, they get results containing deposit. This is pretty impressively bad, gotta say, but maybe there's a common root there somewhere. Must be.

Anyway, we should ease back the stemming or have a way of turning it off someday.

I'm happy to have other examples here too. Just beware that synonyms behave similarly sometimes.

albertisfu commented 1 year ago

Checking this issue, I've confirmed this is also happening in ES, there are some possible solutions.

  1. Deactivating stemming at all, which might be bad for searching on terms that actually have a common root.
  2. Keep stemming functionality and utilize the keyword_marker filter to specify exceptions for certain keywords that should not be stemmed. For instance, in this particular case, the word "deposition" should remain unchanged.
  3. Use dictionary stemming instead of algorithmic stemming, so stemming is based on the dictionary. Although, dictionary-based stemming can be slower and use a significant amount of RAM, also the quality of stemming depends on the dictionary quality.

@mlissner let me know what you think.

mlissner commented 1 year ago

Dictionary stemming is interesting. I'm a bit surprised they worry so much about memory usage, and I'm a bit concerned that hunspell kind of sucks. (I've contributed dozens of legal words to it and still encounter missing ones all the time. It's what Firefox uses, so whenever a word isn't in hunspell as I'm writing an email, I go and report it. Even now, I'm noticing that "hunspell" isn't in Firefox—I guess I'll not report this today?)

Anyhow, even setting that aside, I still feel like it's not great? I think this is more what I had in mind:

https://www.elastic.co/guide/en/elasticsearch/reference/current/mixing-exact-search-with-stemming.html

Did you see this?

albertisfu commented 1 year ago

Got it! so the main issue about using hunspell is the dictionary quality.

I checked in detail the ES recipe you shared. I did some tests and I've identified the following observations:

This approach basically works by adding a multi-field with a variation of the field that won't be affected by stemming.

judge = fields.TextField(attr="judges", analyzer="text_en_splitting_cl",
    fields = {
        'exact': fields.TextField(attr="judges", analyzer="english_exact"),
    }

In this example judge is stemmed while judge.exact remains unaltered.

When performing a query, it is necessary to specify the field in which you wish to search for either the stemmed or the exact version of the term.

To provide users with the option to choose whether to search for the exact version or the stemmed version of a term, it's possible to use the quote_field_suffix=".exact" property in the query_string. This way, when a user searches for a query like information "deposition", the term information will be stemmed and searched within the stemmed version field judge, while deposition will remain unstemmed and be searched within the exact version field judge.exact. This approach ensures that only results containing deposition are returned, while documents containing the term deposit are excluded.

But I believe there are some issues with this current approach:

Let me know what you think.

mlissner commented 1 year ago

And the main concern is the significant increase in index size when employing this approach.

Yeah, I kind of thought this would be an issue. Two questions:

  1. Does it make sense to only do this on the full text field (or do we even have a full text field still)? I don't think most fields need this, but the full text kind of does.

  2. I guess there are a couple parts to the index. One is the inverted word index and the other is the actual terms. For this, I'd expect we'd just have to add more tokens to the inverted word index, but my question is, if document 1 has the phrase "A worm eats information near a deposition" does the duplicated field make an index like this (bad approach; 14 rows):

    token position document number
    a 1 1
    worm 2 1
    eat 3 1
    inform 4 1
    near 5 1
    a 6 1
    deposit 7 1
    a 1 1
    worm 2 1
    eats 3 1
    information 4 1
    near 5 1
    a 6 1
    deposition 7 1

    Or like this (better approach; only adds distinct unstemmed terms; 10 rows):

    token position document number
    a 1 1
    worm 2 1
    eat 3 1
    eats 3 1
    inform 4 1
    information 4 1
    near 5 1
    a 6 1
    deposit 7 1
    depositition 7 1

    In other words, does it duplicate the size of the index or does it only make it larger where the terms differ?

albertisfu commented 1 year ago

Does it make sense to only do this on the full text field (or do we even have a full text field still)? I don't think most fields need this, but the full text kind of does.

I did some tests related to this question. When searching in ES, we utilize the "text" field for displaying snippets and highlights, and is also searchable in the "query_string" query.

So It would make sense to only add the exact version to the "text" field if it were the exclusive field we intend to search in.

Currently in addition to the query_string we use a multi_match that looks like this:

Q(
                "multi_match",
                query=value,
                fields=[
                    "caseName",
                    "docketNumber",
                    "court",
                    "judge",
                    "sha1",
                ],
                type="phrase",
                operator="AND",
                tie_breaker=0.3,
)

This query allows us to return better results and mimc the Solr search behavior after applying boosting. Since the multi_match returns the best score from the matched fields.

Let's consider the document:

Doc1: "A worm eats information near a deposition" Fields indexes: caseName: "a worm eats inform near a deposit" text: "... a worm eats inform near a deposit ..." text.exact: "... a worm eats information near a deposition ..."

Doc2: "A worm eats information near a deposit" Fields indexes: caseName: "a worm eats inform near a deposit" text: "... a worm eats inform near a deposit ..." text.exact: "... a worm eats information near a deposit ..."

If we perform a search for "deposition", the multi_match and the query_string (that also looks over multiple fields) the query will match both documents as they will look into "caseName" for the word root "deposit".

So it would be possible to only add the exact version to the "text" field if we only search in this field excluding all the other fields. However, this approach may likely affect search quality, as the score will be calculated solely based on the "text" field. Further investigation would be necessary to accurately measure the real impact.

An alternative is to add the exact version to all the "Text" fields.

So that the stemmed and unstemmed versions are available for every field.

On the query_string we could use the quote_field_suffix so the search is performed on the "exact" fields if the user uses quotes "" in the query.

quote_field_suffix is not available for multi_match query, but we can accomplish a similar behavior by identifying quotes in a query and changing the fields where the multi_match will look for, to the "exact" version.

[
        "caseName.exact",
        "docketNumber.exact",
        "court.exact",
        "judge.exact",
        "sha1.exact",
]

I guess there are a couple parts to the index. One is the inverted word index and the other is the actual terms. For this, I'd expect we'd just have to add more tokens to the inverted word index, but my question is, if document 1 has the phrase "A worm eats information near a deposition" does the duplicated field make an index like this (bad approach; 14 rows):

Yes, I did some research about this. The actual terms that are stored in the _source are not affected, it remains the original version.

About the inverted index, it's not clear how it works. In documentation it just says that if a field has two versions, two documents are indexed.

And on here it says: multi-field in order to have the same content indexed in two different ways.

A elasticsearch webinar mentions that an inverted index is generated for each version of a field: Screenshot 2023-05-23 at 16 25 44

So that they would be something like:

Document case_name: information near a deposition

For the "caseName" field:

"inform" -> doc1 (caseName)
"near" -> doc1 (caseName)
"a" -> doc1 (caseName)
"deposit" -> doc1 (caseName)

For the "caseName.exact" field:

"information" -> doc1 (caseName.exact)
"near" -> doc1 (caseName.exact)
"a" -> doc1 (caseName.exact)
"deposition" -> doc1 (caseName.exact)

However, these might just be didactic representations for a better understanding. It seems that there is no way to inspect the inverted indexes in Elasticsearch as they are optimized and not human-readable. So internally, it's possible that they are further optimized to prevent the duplication of terms.

mlissner commented 1 year ago

So it sounds like the index will definitely be larger if we do this, but I think we should give it a try anyway, because it's a killer feature people want.

I'm not sure I understand everything about query_string vs multi_match. Let's chat tomorrow or if you get to this and want to provide some examples or further explanation, that'd be great. If we can, I want to use quote_field_suffix, since it would allow a query like...

"information" letter

...to have exact matching on "information" and stemmed matching on letter. That's cool. I'm afraid if we do our own term detection, we won't be able to do that.

albertisfu commented 1 year ago

Thanks. So I'll be working on this in order to implement the exact field for the TextField and use quote_field_suffix. I'll gather some examples where the multi_match is required, right now some tests fail if it's removed but maybe it can be removed by doing some tweaks to the query_string.