Open ohadravid opened 7 years ago
This looks like a bug, as if the cardinality isn't aware of the score when executing the script. I suspect that the script always return 0.0 for all collected documents, thus it counts only 1 value.
Reproduced on master apart from the null pointer exception when using painless (master just returns 1)
@ohadravid we had a discussion about this issue this morning. I've only rarely seen scores used for bucketing in this fashion and we want to make sure we go about any solution in the best way. I think a bit more detail on your use case would help us there. Can you give a bit more detail about what your use case is, what sort of calculations go into your scores, and what do you intend to use the buckets for?
Consider the following data set:
data = [
{'os': 'Windows Server 2012', 'hostname': 'DC01'},
{'os': 'Windows Server 2012', 'hostname': 'MAIL'},
{'os': 'Windows 10', 'hostname': 'PC'}
]
We want to split our data by role
, which is defined as:
{'prefix': {'hostname.keyword': 'DC'}}
are DCs (Score 2){'prefix': {'hostname.keyword': 'PC'}}
are Personal Computers (Score 3)The basic query is this:
{
"query": {
"function_score": {
"score_mode": "min",
"min_score": 2,
"functions": [
{
"filter": {
"prefix": {
"hostname.keyword": "DC"
}
},
"weight": 2
},
{
"filter": {
"prefix": {
"hostname.keyword": "PC"
}
},
"weight": 3
},
{
"filter": {
"match_all": {}
},
"weight": 4
}
]
}
}
}
Also, the roles are ordered (behave like a switch-case, stop after the first match), so if we had a Development
group:
{'prefix': {'hostname.keyword': 'DC'}}
are DCs (Score 2){'prefix': {'hostname.keyword': 'PC'}}
are Personal Computers (Score 3){'prefix': {'hostname.keyword': 'D'}}
are Development (Score 4)We would expect DC01
to be a DC and for DWEB
to be in the Development group.
We sometimes want to filter only some of the groups (for example, give me DCs and Other), and sometimes we want to use terms
by role (which would be terms
on _score
):
{
"query": {
... # Same as above.
},
"aggs": {
"operating_system": {
"terms": {
"field": "os.keyword",
"size": 10
},
"aggs": {
"role": {
"terms": {
"script": {
"inline": "_score",
"lang": "expression"
},
"size": 10
},
"meta": {
"value_map": {
"2.0": "DCs",
"3.0": "Personal Computers",
"4.0": "Other"
}
}
}
}
}
},
"track_scores": true,
"size": 0
}
(We use meta
a bit differently in the actual query. BTW, We use it extensively, not only for this. Awesome feature).
I am aware that the basic query can be implemented using named queries, and the terms
can be implemented using filters aggregation, but that requires writing more complex queries with many bool
with must_not
clauses on other terms. It would also require rewriting the query entirely. (This is a simplified use case, we generate a lot of these queries and they can include many groups).
In this case, we want to use terms
to bucket the data by the os
field, and then know how many roles we have under each os
(I am not aware that this can be done using any other way).
{
"query": {
"bool": {
"must": [
{
"function_score": {
"score_mode": "min",
"min_score": 2,
"functions": [
{
"filter": {
"prefix": {
"hostname.keyword": "DC"
}
},
"weight": 2
},
{
"filter": {
"prefix": {
"hostname.keyword": "PC"
}
},
"weight": 3
},
{
"filter": {
"match_all": {}
},
"weight": 4
}
]
}
}
]
}
},
"aggs": {
"operating_system": {
"terms": {
"field": "os.keyword",
"size": 10
},
"aggs": {
"role": {
"cardinality": {
"script": {
"inline": "_score",
"lang": "expression"
}
}
}
}
}
},
"track_scores": true,
"size": 0
}
It seems to me that you're fundamentally trying to work around a data modeling problem using scoring. For example, I could imagine a world where you used something at ingest time (perhaps ingest node or your own data loading) to add another, more structured field to indicate whether the given object is a DC, a PC, etc. That way you could avoid your prefix queries. You could even add an arbitrary sort order if that's a fixed order. It'd be more complicated, but I think you could also do this all with Painless scripts. I'm curious if you considered that/if I'm missing something?
We model the data in this way on purpose, as we usually define the groups in an ad-hoc manner.
(I use the prefix
query in the original example for simplicity).
Let's consider the following data set:
data = [
{'os': 'Windows Server 2012 R2', 'hostname': 'DC01', 'role': 'DC'},
{'os': 'Windows Server 2012', 'hostname': 'DC02', 'role': 'DC'},
{'os': 'Windows Server 2012 R2', 'hostname': 'MAIL01', 'role': 'Exchange'},
{'os': 'Windows Server 2008', 'hostname': 'MAIL02', 'role': 'Exchange'},
{'os': 'Windows 10', 'hostname': 'PC', 'role': 'Personal Computer'}
]
For example, we might want to:
{ 'terms': { 'role': ['DC', 'Exchange'] } }
) and Non Production (all else).There can be a lot of role groups: All Core Severs (DC, Backup DC, ESX, ...), Mail System (Exchange, CAS, OWA, ...). The actual group definition is context-dependent, so ingest time is not possible.
As another example, we might want to:
{ 'terms': { 'role': ['Windows XP', 'Windows Server 2008'] } }
) and New ({ 'terms': { 'role': ['Windows 7', 'Windows 10', 'Windows Server 2016'] } }
).Since operating systems are deprecated every now and then (for example, we might deprecate Server 2012, making it part of the Old group) setting this information when ingesting the data is not possible, and it makes more sense to model the data with the actual data from the computer.
(Note that we do some pre-computing, since role
can be computed from other data on the server such as running processes, installed services, open ports, etc).
Hi, Just wanted to check if and when this might be fixed?
Personally, I'd prefer to see this requirement implemented as a feature to allow matching query clauses to attach some form of user-supplied tag/metadata to the docs they match which could then be picked up by aggregations as if they were a property of the original doc. These derived doc properties would be transient and only live for the duration of the aggregation request.
@markharwood How is that different than an aggregation on a script? The script can produce whatever value it wants, derived from the fields of the document.
How is that different than an aggregation on a script
The question was about searching for A, B OR C and not having to repeat whatever the A, B and C clauses are in the aggregation part of the request.
Example: if A was (solr OR lucidworks) and B was (elasticsearch OR logstash OR kibana) then an agg script could not easily recreate these free-text queries without resorting to fielddata.
allow matching query clauses to attach some form of user-supplied tag/metadata to the docs
That's actually what we wanted originally, and found out that the only way to attach data to the docs is using _score
(which is a bit of an abuse, I admit).
While having it as a first class feature has many benefits (for example, we sometime define scripted_fields
, only to repeat them multiple times in each query as scripted_mertic
or scripted_bucket
), (a) it will take a lot of time (so we won't see it in the next release) and (b) it's a bug! It's true that this is a less traveled road, but I do expect cardinality
to behave like min
or max
, so as bug it should be fixed, even if the use case is a bit dubious.
the only way to attach data to the docs is using _score (which is a bit of an abuse, I admit).
An abuse I've also had to resort to in other Lucene-related projects :) I've never felt happy about score being the only means of smuggling any match-metadata out of the Lucene querying process. I can think of a couple of alternative approaches as a way to implement it:
sampler
agg or breadth_first
collection.The former would allow richer metadata but the latter would be more efficient.
as bug it should be fixed, even if the use case is a bit dubious.
There are added dangers in supporting this use case. Encouraging people to do cardinalities aggs on low-cardinality fields (eg controlled scores that are only either values 1, 2 or 3) will lead people into this doozie.
Encouraging people to do cardinalities aggs on low-cardinality fields (eg controlled scores that are only either values 1, 2 or 3) will lead people into this doozie.
You allow someone to do cardinality
on any field (and you trust them not to do it on fields where it doesn't make any sense, like a message's @timestamp
). You should allow someone to do it on _score
(and trust them in the same way).
You provide sharp tools, people cut themselves and learn from it. That's the way it should be 🙂
(And getting an incorrect answer is much worse than getting a slow answer)
Re tagged queries - had a chat with a member of the Lucene team who thought that adding support would impede the ability to optimize queries via rewrites or skip-aheads. The feeling is the better trade-off is to repeat the clause expressions in filters
in the agg tree which should be efficient to execute.
So it seems like we might need to keep hacking _score
for some time :)
Do you think this would be fixed before the next release?
You provide sharp tools, people cut themselves and learn from it. That's the way it should be
Since we're exploring analogies, you have to also understand that there's an organization that's behind Elasticsearch and when people "cut themselves" (we use the term "shoot themselves in the foot" more often here) they come to our support organization complaining about how they were able to shoot themselves in the foot and how they need emergency services. While some people want the sharpest tools, there's a support resource & financial burden that comes with things that come with this type of encouragement. There are some experienced, long-time users of Elasticsearch (some of which may rely on quirks of the software), but the growing popularity with newer/less experienced users means we want to avoid situations where people can do bad things.
(And getting an incorrect answer is much worse than getting a slow answer)
That's something we may have fundamental differences of opinion on. When given the choice between speed and potentially losing some accuracy, Elasticsearch has chosen speed.
I still don't see how cardinality on score is any different than cardinality on any other field. You need to know the approximate cardinality when you design the query, and be careful. True for @timestamp
and true for _score
.
I agree that losing some accuracy for performance is quite alright, but when you return 0 for every query/data I say you might have taken this a bit too far :)
To be more accurate, if a feature is not implemented, failing silently and returning incorrect data (while very fast) is much worse than getting an exception.
Hi!
This still happens in 6.2. What are the odds that this can be addressed? We really want to do those kind of queries..
Thanks, Ohad
@elastic/es-search-aggs
This is basically a slightly different version of #26294. In both cases collector doesn't get scorer set which results in _score
always being 0. This one has a slightly different code path comparing to #26294 so I am going to leave it open to make sure this path is covered as well.
Pinging @elastic/es-analytical-engine (Team:Analytics)
Elasticsearch version: 5.4.2 (Using the latest official docker - docker.elastic.co/elasticsearch/elasticsearch:5.4.2)
Plugins installed: []
JVM version (
java -version
): openjdk version "1.8.0_131"OS version (
uname -a
if on a Unix-like system): Linux 3f86925d6644 4.9.27-moby #1 SMP Thu May 11 04:01:18 UTC 2017 x86_64 x86_64 x86_64 GNU/LinuDescription of the problem including expected versus actual behavior:
When doing a cardinality aggregation on document score, an incorrect value of
1
is returned regardless of actual data. This is useful to us in a case of a complexfunction_score
query, which return a number of discrete (integers, in our case) values.Steps to reproduce:
The following query:
The result for the cardinality should reflect the random scores, however this:
is returned from ES. Note that
min
andmax
work as expected, butcardinality
doesn't.Switching to `painless' lang produces:
Steps to reproduce: