elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.19k stars 24.84k forks source link

“totalTermFreq must be at least docFreq” error after upgrading to 8.4.2 #90275

Closed alexwlchan closed 2 years ago

alexwlchan commented 2 years ago

Elasticsearch Version

8.4.2 (in Docker)

Installed Plugins

Whatever comes in the Elasticsearch Docker image

Java Version

bundled (java -version doesn't work inside the Docker image)

OS Version

Whatever's in the Elasticsearch Docker image:

$ uname -a
Linux c8dae8d214e8 5.10.76-linuxkit #1 SMP Mon Nov 8 10:21:19 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Problem Description

We're seeing the following error from queries in Elasticsearch 8.4.2:

totalTermFreq must be at least docFreq, totalTermFreq: 413958, docFreq: 413959

These are queries that worked successfully in Elasticsearch 8.4.1.

The errors are reproducible, but don't affect all queries.

The numbers in the error message seem to be stable.

Steps to Reproduce

The index mapping ```json { "dynamic": "strict", "properties": { "edition": { "type": "text" }, "notes": { "type": "text" }, "physicalDescription": { "type": "text" } } } ```
An example document ```json { "notes": [ "This material has been provided by University of Bristol Library. The original may be consulted at University of Bristol Library." ], "physicalDescription": "xv, 276 pages ; 19 cm", "edition": "New [3rd] ed." } ```
The query ```json { "bool": { "should": [ { "multi_match": { "query": "1", "fields": [ "physicalDescription", "edition", "notes" ], "type": "cross_fields", "operator": "And", "_name": "data" } } ] } } ```
The error ```json { "error": { "root_cause": [ { "type": "illegal_argument_exception", "reason": "totalTermFreq must be at least docFreq, totalTermFreq: 413958, docFreq: 413959" } ], "type": "search_phase_execution_exception", "reason": "all shards failed", "phase": "query", "grouped": true, "failed_shards": [ { "shard": 0, "index": "works-indexed-2022-08-24", "node": "ZTz1Ump1THGC5lMd9VL3XQ", "reason": { "type": "illegal_argument_exception", "reason": "totalTermFreq must be at least docFreq, totalTermFreq: 413958, docFreq: 413959" } } ], "caused_by": { "type": "illegal_argument_exception", "reason": "totalTermFreq must be at least docFreq, totalTermFreq: 413958, docFreq: 413959", "caused_by": { "type": "illegal_argument_exception", "reason": "totalTermFreq must be at least docFreq, totalTermFreq: 413958, docFreq: 413959" } } }, "status": 400 } ```

Steps to reproduce

  1. Start an instance of the Elasticsearch Docker container:

    docker run \
      --env xpack.security.enabled=false \
      --env discovery.type=single-node \
      --publish 9200:9200 \
      -it docker.elastic.co/elasticsearch/elasticsearch:8.4.2

    (These settings may not all be required, but they're how I usually the local Docker image and they avoid having to do the password/CA certs dance. I’m running Docker on macOS, although I don’t think it makes a difference.)

  2. Download the following data set from S3: https://wellcomecollection-data-public-delta.s3.eu-west-1.amazonaws.com/elasticsearch-issue-files/works.json.gz (40MB)

    This contains ~900k documents with a minimal set of fields to cause this error – if I reduce the number of documents or fields, the error goes away.

    (Corpus is CC-BY 4.0 licensed Wellcome Collection, similar to https://developers.wellcomecollection.org/docs/datasets)

  3. Run the attached Python script, which will:

    • Create a new index with our index mapping
    • Load the contents of works.json.gz into the index
    • Make a minimal query that induces the error


    repro.py ```python import gzip import json import random from elasticsearch import Elasticsearch from elasticsearch.helpers import bulk client = Elasticsearch("http://localhost:9200") index = f"example-{random.randint(0, 10000)}" resp = client.indices.create( index=index, mappings={ "dynamic": "strict", "properties": { "edition": {"type": "text"}, "notes": {"type": "text"}, "physicalDescription": {"type": "text"}, }, }, ) print("Create index resp:") print(resp) print("") def get_actions(): for line in gzip.open("works.json.gz"): yield {"_index": index, "_op_type": "index", "_source": json.loads(line)} bulk_resp = bulk(client, get_actions()) print("Bulk resp:") print(bulk_resp) print("") query_resp = client.search( index=index, query={ "bool": { "should": [ { "multi_match": { "query": "1", "fields": [ "physicalDescription", "edition", "notes", ], "type": "cross_fields", "operator": "And", "_name": "data", } } ] } }, ) print("Query resp:") print(json.dumps(query_resp.raw, indent=2, sort_keys=True)) ```

Expected behaviour

The search request returns a list of results.

Actual behaviour

We get an error from the Elasticsearch Python library:

elasticsearch.BadRequestError: BadRequestError(400, 'search_phase_execution_exception', 'totalTermFreq must be at least docFreq, totalTermFreq: 287039, docFreq: 287040')

Notes

Logs (if relevant)

No response

elasticsearchmachine commented 2 years ago

Pinging @elastic/es-search (Team:Search)

nik9000 commented 2 years ago

"type": "cross_fields",

I believe the bug is with cross_fields. @jtibshirani might be able to confirm. I'll see if I can reproduce and get a stack trace.

alexwlchan commented 2 years ago

(Warning: guesswork based on minimal understanding of the Elasticsearch codebase)

Going off the cross_fields as a possible clue, I had a look at the commits between 8.4.1 and 8.4.2.

59f96a2 stood out to me, in particular this line in the commit message:

It makes sense that the term document frequency should never exceed the number of documents containing the field.

If that's true, and I’ve understood it correctly, then combined with this error message I think the term document frequency and number of documents always have to be the same? And maybe that's not always the case?

nik9000 commented 2 years ago

Thanks for the wonderful reproduction. I believe you've tracked it down properly. I'll grab a stack trace from your wonderful reproduction and we'll have a smoking gun. I mean cross_fields is pretty much a smoking gun, but, yeah, one moment!

nik9000 commented 2 years ago

Ok! Indeed. I can reproduce it against the head of the 8.4 branch. With assertions enabled it crashes here:

[2022-09-22T16:00:15,749][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [runTask-0] fatal error in thread [elasticsearch[runTask-0][search][T#1]], exiting java.lang.AssertionError
        at org.apache.lucene.core@9.3.0/org.apache.lucene.index.TermStates.accumulateStatistics(TermStates.java:156)
        at org.apache.lucene.core@9.3.0/org.apache.lucene.index.TermStates.register(TermStates.java:137)
        at org.elasticsearch.server@8.4.3-SNAPSHOT/org.elasticsearch.lucene.queries.BlendedTermQuery.adjustTTF(BlendedTermQuery.java:185)
        at org.elasticsearch.server@8.4.3-SNAPSHOT/org.elasticsearch.lucene.queries.BlendedTermQuery.blend(BlendedTermQuery.java:164)
        at org.elasticsearch.server@8.4.3-SNAPSHOT/org.elasticsearch.lucene.queries.BlendedTermQuery.rewrite(BlendedTermQuery.java:84)
        at org.apache.lucene.core@9.3.0/org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:766)
        at org.elasticsearch.server@8.4.3-SNAPSHOT/org.elasticsearch.search.internal.ContextIndexSearcher.rewrite(ContextIndexSearcher.java:135)
        at org.elasticsearch.server@8.4.3-SNAPSHOT/org.elasticsearch.search.internal.SearchContext.rewrittenQuery(SearchContext.java:271)
        at org.elasticsearch.server@8.4.3-SNAPSHOT/org.elasticsearch.search.query.QueryPhase.executeInternal(QueryPhase.java:106)
        at org.elasticsearch.server@8.4.3-SNAPSHOT/org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:80)
        at org.elasticsearch.server@8.4.3-SNAPSHOT/org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:464)
        at org.elasticsearch.server@8.4.3-SNAPSHOT/org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:627)
        at org.elasticsearch.server@8.4.3-SNAPSHOT/org.elasticsearch.search.SearchService.lambda$executeQueryPhase$2(SearchService.java:489)
        at org.elasticsearch.server@8.4.3-SNAPSHOT/org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:47)
        at org.elasticsearch.server@8.4.3-SNAPSHOT/org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62)
        at org.elasticsearch.server@8.4.3-SNAPSHOT/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
        at org.elasticsearch.server@8.4.3-SNAPSHOT/org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
        at org.elasticsearch.server@8.4.3-SNAPSHOT/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:769)
        at org.elasticsearch.server@8.4.3-SNAPSHOT/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:833)
nik9000 commented 2 years ago

That's right there next to the suspected commit. I'll confirm, but, that's looking likely.

So! We'll fix it. But in the mean time, I expect you can work around this by removing cross_fields. That's not great from a search perspective though.

nik9000 commented 2 years ago

I'll confirm

Indeed, before the suspect there aren't problems.

jtibshirani commented 2 years ago

@alexwlchan thank you for reporting this and for the clear reproduction! I'm working on a fix now.

jtibshirani commented 2 years ago

I uploaded a fix here: https://github.com/elastic/elasticsearch/pull/90278. I tried running the reproduction script, and the query now completes successfully.

The fix will be available in 8.4.3. The bug was also present in 7.17.7 and 8.5.0, but we caught it early enough that it'll be fixed before those are released.

alexwlchan commented 2 years ago

Thanks for the quick replies everyone! We've been able to reindex into a cluster running Elasticsearch 8.3 for now, and we'll wait for 8.4.3 before upgrading.

leovalen commented 2 years ago

I uploaded a fix here: #90278. I tried running the reproduction script, and the query now completes successfully.

The fix will be available in 8.4.3. The bug was also present in 7.17.7 and 8.5.0, but we caught it early enough that it'll be fixed before those are released.

Does 8.4.3 have a scheduled release date?

nik9000 commented 2 years ago

Does 8.4.3 have a scheduled release date?

We have a policy against giving dates but we're actively working on the release now. We're aware that for folks using cross_fields this is a big deal and we're moving.

alexwlchan commented 2 years ago

FYI @leovalen (and anybody else looking at this issue) – 8.4.3 was released a few hours ago, and I've confirmed my repro is no longer failing with that version. 🎉