elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.55k stars 24.62k forks source link

Background Count (bg_count) Remains Zero in Nested and Filtered significant_terms Aggregation #101163

Open Emporea opened 11 months ago

Emporea commented 11 months ago

Elasticsearch Version

8.10.2

Installed Plugins

No response

Java Version

bundled

OS Version

Debian 6.1

Problem Description

Hi everyone,

I've recently started using the significant_terms aggregation with a nested field in my index, and I've noticed that the results are very similar to those of a standard terms aggregation. This leads me to believe that the background calculations for significance might not be working as expected with nested fields. The bg_count is 0 as shown in this bucket list.


    "aggregations": {
      "significant_terms_nested": {
        "doc_count": 3823,
        "pos_filter": {
          "doc_count": 1522,
          "significant_terms": {
            "doc_count": 1522,
            "bg_count": 1445178,
            "buckets": [
              {
                "key": "chatgpt",
                "doc_count": 222,
                "score": 30746.516992131186,
                "bg_count": 0
              },
              {
                "key": "ai",
                "doc_count": 93,
                "score": 5395.764864337504,
                "bg_count": 0
              },
              {
                "key": "chatbot",
                "doc_count": 23,
                "score": 330.01054874542626,
                "bg_count": 0
              },
              {
                "key": "openai",
                "doc_count": 21,
                "score": 275.1115639046071,
                "bg_count": 0
              },
              {
                "key": "google",
                "doc_count": 19,
                "score": 225.20351532753946,
                "bg_count": 0
              },
              {
                "key": "rival",
                "doc_count": 15,
                "score": 140.3602269646585,
                "bg_count": 0
              }, ...

Here's a simplified version of my index mapping:

{
  "properties": {
    "my_field": {
      "type": "nested",
      "properties": {
        "txt": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword"
            }
          }
        },
        "pos": {
          "type": "keyword"
        }
      }
    }
  }
}

To provide more clarity, I'm using the significant_terms aggregation as follows, where I'm filtering based on the pos field before performing the aggregation:

{
  "significant_terms_nested": {
    "nested": {
      "path": "my_field"
    },
    "aggs": {
      "pos_filter": {
        "filter": {
          "terms": {
            "my_field.pos": ["noun", "verb", "adj"]
          }
        },
        "aggs": {
          "significant_terms": {
            "significant_terms": {
              "field": "my_field.txt.keyword",
              "size": 50
            }
          }
        }
      }
    }
  }
}

My primary questions are:

  1. When using significant_terms on a nested field, especially after filtering by a nested field's value (like pos in my case), do I need to specify to Elasticsearch which field to use for the background search? I'd expect the background scan to consider the entire index without any filters applied. If so, how do I ensure this?
  2. Is it mandatory for a field to be mapped as text for the significant_terms aggregation to work properly? Or is it sufficient if a field is only mapped as a keyword?

Initially, I mapped the field to .txt with only the keyword type. After conducting the significant_terms aggregation, I noticed that the terms returned were not as "significant" as I had expected. I began to wonder if this inconsistency was due to not mapping the field as text in addition to keyword. Hoping to get more relevant results, I made the change to include the text mapping. However, to my disappointment, this alteration didn't bring about any notable difference in the aggregation results.

Also applying this has no effect, bg_count is still 0:

"background_filter": {
  "match_all": {}
}

Any insights or guidance on this would be greatly appreciated. Thanks in advance!

Steps to Reproduce

[
  {
    "txt": "word1",
    "pos": "POS_TYPE"
  },
  {
    "txt": "word2",
    "pos": "POS_TYPE"
  },
  ...
  {
    "txt": "wordN",
    "pos": "POS_TYPE"
  }
]
  1. Index documents with the my_field field structured as shown above.
  2. Apply the significant_terms aggregation using nested and filtered queries on the my_field field.
  3. Observe the bg_count in the aggregation results.

Logs (if relevant)

No response

elasticsearchmachine commented 10 months ago

Pinging @elastic/es-analytics-geo (Team:Analytics)