elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.55k stars 24.61k forks source link

Cardinality on `_score` return incorrect results #25337

Open ohadravid opened 7 years ago

ohadravid commented 7 years ago

Elasticsearch version: 5.4.2 (Using the latest official docker - docker.elastic.co/elasticsearch/elasticsearch:5.4.2)

Plugins installed: []

JVM version (java -version): openjdk version "1.8.0_131"

OS version (uname -a if on a Unix-like system): Linux 3f86925d6644 4.9.27-moby #1 SMP Thu May 11 04:01:18 UTC 2017 x86_64 x86_64 x86_64 GNU/Linu

Description of the problem including expected versus actual behavior:

When doing a cardinality aggregation on document score, an incorrect value of 1 is returned regardless of actual data. This is useful to us in a case of a complex function_score query, which return a number of discrete (integers, in our case) values.

Steps to reproduce:

The following query:

{
   "aggs": {
      "score_dc": {
         "cardinality": {
            "script": {
               "inline": "_score",
               "lang": "expression"
            }
         }
      },
      "score_min": {
         "min": {
            "script": {
               "inline": "_score",
               "lang": "expression"
            }
         }
      },
      "score_max": {
         "max": {
            "script": {
               "inline": "_score",
               "lang": "expression"
            }
         }
      }
   },
   "query": {
      "function_score": {
         "functions": [
            {
              "random_score": { 
                "seed":  0
              }
            }
         ]
      }
   },
   "size": 0
}

The result for the cardinality should reflect the random scores, however this:

{
   "took": 5,
   "timed_out": false,
   "_shards": {
      "total": 3,
      "successful": 3,
      "failed": 0
   },
   "hits": {
      "total": 11,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "score_dc": {
         "value": 1
      },
      "score_max": {
         "value": 0.7933780550956726
      },
      "score_min": {
         "value": 0.0467032790184021
      }
   }
}

is returned from ES. Note that min and max work as expected, but cardinality doesn't.

Switching to `painless' lang produces:

{
   "took": 14,
   "timed_out": false,
   "_shards": {
      "total": 6,
      "successful": 1,
      "failed": 5,
      "failures": [
         {
            "shard": 0,
            "index": ".monitoring-alerts-2",
            "node": "wr5-w_JRT9eeSPUicopRJw",
            "reason": {
               "type": "null_pointer_exception",
               "reason": null
            }
         }
      ]
   },
   "hits": {
      "total": 0,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "score_dc": {
         "value": 0
      }
   }
}

Steps to reproduce:

  1. Launch the official docker.
  2. Run the above query (you do need to wait for a couple of docs to be inserted before running it).
tlrx commented 7 years ago

This looks like a bug, as if the cardinality isn't aware of the score when executing the script. I suspect that the script always return 0.0 for all collected documents, thus it counts only 1 value.

markharwood commented 7 years ago

Reproduced on master apart from the null pointer exception when using painless (master just returns 1)

eskibars commented 7 years ago

@ohadravid we had a discussion about this issue this morning. I've only rarely seen scores used for bucketing in this fashion and we want to make sure we go about any solution in the best way. I think a bit more detail on your use case would help us there. Can you give a bit more detail about what your use case is, what sort of calculations go into your scores, and what do you intend to use the buckets for?

ohadravid commented 7 years ago

Consider the following data set:

data = [
    {'os': 'Windows Server 2012', 'hostname': 'DC01'},
    {'os': 'Windows Server 2012', 'hostname': 'MAIL'},
    {'os': 'Windows 10', 'hostname': 'PC'}
]

We want to split our data by role, which is defined as:

  1. {'prefix': {'hostname.keyword': 'DC'}} are DCs (Score 2)
  2. {'prefix': {'hostname.keyword': 'PC'}} are Personal Computers (Score 3)
  3. Computers not matching any of them, which are Other (Score 4)

The basic query is this:

{
   "query": {
      "function_score": {
         "score_mode": "min",
         "min_score": 2,
         "functions": [
            {
               "filter": {
                  "prefix": {
                     "hostname.keyword": "DC"
                  }
               },
               "weight": 2
            },
            {
               "filter": {
                  "prefix": {
                     "hostname.keyword": "PC"
                  }
               },
               "weight": 3
            },
            {
               "filter": {
                  "match_all": {}
               },
               "weight": 4
            }
         ]
      }
   }
}

Also, the roles are ordered (behave like a switch-case, stop after the first match), so if we had a Development group:

  1. {'prefix': {'hostname.keyword': 'DC'}} are DCs (Score 2)
  2. {'prefix': {'hostname.keyword': 'PC'}} are Personal Computers (Score 3)
  3. {'prefix': {'hostname.keyword': 'D'}} are Development (Score 4)
  4. Computers not matching any of them, which are Other (Score 5)

We would expect DC01 to be a DC and for DWEB to be in the Development group.

We sometimes want to filter only some of the groups (for example, give me DCs and Other), and sometimes we want to use terms by role (which would be terms on _score):

{
   "query": {
   ... # Same as above.
   },
   "aggs": {
      "operating_system": {
         "terms": {
            "field": "os.keyword",
            "size": 10
         },
         "aggs": {
            "role": {
               "terms": {
                  "script": {
                      "inline": "_score",
                      "lang": "expression"
                  },
                  "size": 10
               },
               "meta": {
                   "value_map": {
                       "2.0": "DCs",
                       "3.0": "Personal Computers",
                       "4.0": "Other"
                   }
               }
            }
         }
      }
   },
   "track_scores": true,
   "size": 0
}

(We use meta a bit differently in the actual query. BTW, We use it extensively, not only for this. Awesome feature).

I am aware that the basic query can be implemented using named queries, and the terms can be implemented using filters aggregation, but that requires writing more complex queries with many bool with must_not clauses on other terms. It would also require rewriting the query entirely. (This is a simplified use case, we generate a lot of these queries and they can include many groups).

In this case, we want to use terms to bucket the data by the os field, and then know how many roles we have under each os (I am not aware that this can be done using any other way).

{
  "query": {
    "bool": {
      "must": [
        {
          "function_score": {
            "score_mode": "min",
            "min_score": 2,
            "functions": [
              {
                "filter": {
                  "prefix": {
                    "hostname.keyword": "DC"
                  }
                },
                "weight": 2
              },
              {
                "filter": {
                  "prefix": {
                    "hostname.keyword": "PC"
                  }
                },
                "weight": 3
              },
              {
                "filter": {
                  "match_all": {}
                },
                "weight": 4
              }
            ]
          }
        }
      ]
    }
  },
   "aggs": {
      "operating_system": {
         "terms": {
            "field": "os.keyword",
            "size": 10
         },
         "aggs": {
            "role": {
               "cardinality": {
                  "script": {
                     "inline": "_score",
                     "lang": "expression"
                  }
               }
            }
         }
      }
   },
   "track_scores": true,
   "size": 0
}
eskibars commented 7 years ago

It seems to me that you're fundamentally trying to work around a data modeling problem using scoring. For example, I could imagine a world where you used something at ingest time (perhaps ingest node or your own data loading) to add another, more structured field to indicate whether the given object is a DC, a PC, etc. That way you could avoid your prefix queries. You could even add an arbitrary sort order if that's a fixed order. It'd be more complicated, but I think you could also do this all with Painless scripts. I'm curious if you considered that/if I'm missing something?

ohadravid commented 7 years ago

We model the data in this way on purpose, as we usually define the groups in an ad-hoc manner. (I use the prefix query in the original example for simplicity).

Let's consider the following data set:

data = [
    {'os': 'Windows Server 2012 R2', 'hostname': 'DC01', 'role': 'DC'},
    {'os': 'Windows Server 2012', 'hostname': 'DC02', 'role': 'DC'},
    {'os': 'Windows Server 2012 R2', 'hostname': 'MAIL01', 'role': 'Exchange'},
    {'os': 'Windows Server 2008', 'hostname': 'MAIL02', 'role': 'Exchange'},
    {'os': 'Windows 10', 'hostname': 'PC', 'role': 'Personal Computer'}
]

For example, we might want to:

  1. Split the computers to Production ({ 'terms': { 'role': ['DC', 'Exchange'] } }) and Non Production (all else).
  2. Get all the operating systems under each group.

There can be a lot of role groups: All Core Severs (DC, Backup DC, ESX, ...), Mail System (Exchange, CAS, OWA, ...). The actual group definition is context-dependent, so ingest time is not possible.

As another example, we might want to:

  1. Split the computers to Old ({ 'terms': { 'role': ['Windows XP', 'Windows Server 2008'] } } ) and New ({ 'terms': { 'role': ['Windows 7', 'Windows 10', 'Windows Server 2016'] } } ).
  2. Get all the roles under each group.

Since operating systems are deprecated every now and then (for example, we might deprecate Server 2012, making it part of the Old group) setting this information when ingesting the data is not possible, and it makes more sense to model the data with the actual data from the computer.

(Note that we do some pre-computing, since role can be computed from other data on the server such as running processes, installed services, open ports, etc).

ohadravid commented 7 years ago

Hi, Just wanted to check if and when this might be fixed?

markharwood commented 7 years ago

Personally, I'd prefer to see this requirement implemented as a feature to allow matching query clauses to attach some form of user-supplied tag/metadata to the docs they match which could then be picked up by aggregations as if they were a property of the original doc. These derived doc properties would be transient and only live for the duration of the aggregation request.

rjernst commented 7 years ago

@markharwood How is that different than an aggregation on a script? The script can produce whatever value it wants, derived from the fields of the document.

markharwood commented 7 years ago

How is that different than an aggregation on a script

The question was about searching for A, B OR C and not having to repeat whatever the A, B and C clauses are in the aggregation part of the request.

Example: if A was (solr OR lucidworks) and B was (elasticsearch OR logstash OR kibana) then an agg script could not easily recreate these free-text queries without resorting to fielddata.

ohadravid commented 7 years ago

allow matching query clauses to attach some form of user-supplied tag/metadata to the docs

That's actually what we wanted originally, and found out that the only way to attach data to the docs is using _score (which is a bit of an abuse, I admit).

While having it as a first class feature has many benefits (for example, we sometime define scripted_fields, only to repeat them multiple times in each query as scripted_mertic or scripted_bucket), (a) it will take a lot of time (so we won't see it in the next release) and (b) it's a bug! It's true that this is a less traveled road, but I do expect cardinality to behave like min or max, so as bug it should be fixed, even if the use case is a bit dubious.

markharwood commented 7 years ago

the only way to attach data to the docs is using _score (which is a bit of an abuse, I admit).

An abuse I've also had to resort to in other Lucene-related projects :) I've never felt happy about score being the only means of smuggling any match-metadata out of the Lucene querying process. I can think of a couple of alternative approaches as a way to implement it:

The former would allow richer metadata but the latter would be more efficient.

as bug it should be fixed, even if the use case is a bit dubious.

There are added dangers in supporting this use case. Encouraging people to do cardinalities aggs on low-cardinality fields (eg controlled scores that are only either values 1, 2 or 3) will lead people into this doozie.

ohadravid commented 7 years ago

Encouraging people to do cardinalities aggs on low-cardinality fields (eg controlled scores that are only either values 1, 2 or 3) will lead people into this doozie.

You allow someone to do cardinality on any field (and you trust them not to do it on fields where it doesn't make any sense, like a message's @timestamp). You should allow someone to do it on _score (and trust them in the same way).

You provide sharp tools, people cut themselves and learn from it. That's the way it should be 🙂
(And getting an incorrect answer is much worse than getting a slow answer)

markharwood commented 7 years ago

Re tagged queries - had a chat with a member of the Lucene team who thought that adding support would impede the ability to optimize queries via rewrites or skip-aheads. The feeling is the better trade-off is to repeat the clause expressions in filters in the agg tree which should be efficient to execute.

ohadravid commented 7 years ago

So it seems like we might need to keep hacking _score for some time :) Do you think this would be fixed before the next release?

eskibars commented 7 years ago

You provide sharp tools, people cut themselves and learn from it. That's the way it should be

Since we're exploring analogies, you have to also understand that there's an organization that's behind Elasticsearch and when people "cut themselves" (we use the term "shoot themselves in the foot" more often here) they come to our support organization complaining about how they were able to shoot themselves in the foot and how they need emergency services. While some people want the sharpest tools, there's a support resource & financial burden that comes with things that come with this type of encouragement. There are some experienced, long-time users of Elasticsearch (some of which may rely on quirks of the software), but the growing popularity with newer/less experienced users means we want to avoid situations where people can do bad things.

(And getting an incorrect answer is much worse than getting a slow answer)

That's something we may have fundamental differences of opinion on. When given the choice between speed and potentially losing some accuracy, Elasticsearch has chosen speed.

ohadravid commented 7 years ago

I still don't see how cardinality on score is any different than cardinality on any other field. You need to know the approximate cardinality when you design the query, and be careful. True for @timestamp and true for _score.

I agree that losing some accuracy for performance is quite alright, but when you return 0 for every query/data I say you might have taken this a bit too far :)

To be more accurate, if a feature is not implemented, failing silently and returning incorrect data (while very fast) is much worse than getting an exception.

ohadravid commented 6 years ago

Hi!

This still happens in 6.2. What are the odds that this can be addressed? We really want to do those kind of queries..

Thanks, Ohad

colings86 commented 6 years ago

@elastic/es-search-aggs

imotov commented 4 years ago

This is basically a slightly different version of #26294. In both cases collector doesn't get scorer set which results in _score always being 0. This one has a slightly different code path comparing to #26294 so I am going to leave it open to make sure this path is covered as well.

elasticsearchmachine commented 2 months ago

Pinging @elastic/es-analytical-engine (Team:Analytics)