elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.63k stars 24.64k forks source link

Significant Text Aggregation - Getting inconsistent results with same background & foreground set #88904

Open foboni opened 2 years ago

foboni commented 2 years ago

Elasticsearch Version

8.2.2

Installed Plugins

No response

Java Version

11.0.15

OS Version

Ubuntu 20.04

Problem Description

I'm running a significant text aggregation query on an index with 20 shards. The purpose of running this query is to get top 100 significant text from a field in the index based on a foreground query which searches documents from current month and a background query which contains all documents from current & previous month. There are no active writes to this index.

But when I run significant text aggregation query, I'm getting inconsistent results. I have noticed the below differences between consecutive queries,

Doc stats from stats API

{
  "count": 6975043,
  "deleted": 628082
}

Query:

GET my-index/search
{
  "_source": false,
  "query": {
    "bool": {
      "filter": [
        {
          "query_string": {
            "query": "field1:100 AND field2:2000"
          }
        },
        {
          "range": {
            "date": {
              "gte": "2022-07-01",
              "lte": "2022-07-28"
            }
          }
        }
      ]
    }
  },
  "aggregations": {
    "google": {
      "significant_text": {
        "size": 100,
        "field": "content",
        "gnd": {"background_is_superset": true}, 
        "filter_duplicate_text": true,
        "background_filter": {
          "filter": [
            {
              "query_string": {
                "query": "field1:100 AND field2:2000"
              }
            },
            {
              "range": {
                "date": {
                  "gte": "2022-06-01",
                  "lte": "2022-07-28"
                }
              }
            }
          ]
        }
      }
    }
  }
}

Steps to Reproduce

--

Logs (if relevant)

No response

foboni commented 2 years ago

I don't have steps to reproduce this issue.

I did tried to replicate this issue by creating a new index and re-indexing docs returned by my background set to that newly created index using _reindex API. After re-indexing when I ran the same significant text aggregation query on that index, I'm getting consistent results.

elasticsearchmachine commented 2 years ago

Pinging @elastic/es-analytics-geo (Team:Analytics)

foboni commented 2 years ago

@markharwood Could you please take a look at this ? Is there any way to get consistent results.

wchaparro commented 1 year ago

Hello @foboni are you able to provide an example of the inconsistencies you are experiencing please? Are you hitting different shard replicas when running this particular aggregation?

foboni commented 1 year ago

@wchaparro

are you able to provide an example of the inconsistencies you are experiencing please?

No

Are you hitting different shard replicas when running this particular aggregation?

To verify this, I even tried re-indexing query results to a single shard index. Even then the significant text aggregations results are inconsistent (better compared to running against original index).