Open mbarretta opened 3 years ago
Pinging @elastic/es-analytics-geo (Team:Analytics)
I'm not sure how longs its been like this. I can look into it. If we decide this is the right behavior we should at least document it. Folks don't always expect it.
Elasticsearch 1.7's sum
aggregation returns 0
when aggregating on empty fields. I think we probably shouldn't change it now. But we should make sure its documented.
With that being the case, is there any way to have a Kibana line or bar type visualization aggregate data without plotting a point at 0 when all of the values being aggregated over the bucket are null?
@nik9000 I agree that it doesn't make sense to change it now if it's been this way for so long. That said, per @kspurrier, can we have a setting (or value for the existing missing
) setting) that gives null
vs 0
for missing values?
From the PR associated, Postgres is referenced which is a darn good example. Postgres can be reasonable in getting away with something like this because one can easily just use COALESCE. If there's an equivalent workaround (in the other direction) with ES / Kibana, that'd be awesome in the docs.
Example: create table es_is_not_a_db (id integer, name varchar(100), rank varchar(100), failures integer); insert into es_is_not_a_db (id, name, rank) values (1, 'Joan', 'Boss'); insert into es_is_not_a_db (id, name, rank) values (1, 'Joe', 'Worker Bee'); insert into es_is_not_a_db (id, name, rank) values (1, 'Bob', 'Worker Bee'); select COUNT(failures), SUM(failures), COALESCE(sum(failures), 0) from es_is_not_a_db;
Returns:
count | sum | coalesce |
---|---|---|
0 | 0 |
@nik9000 I agree that it doesn't make sense to change it now if it's been this way for so long. That said, per @kspurrier, can we have a setting (or value for the existing
missing
) setting) that givesnull
vs0
for missing values?
I recall that being trickier to do than it sounded the last time folks talked about it. Could you link to an issue this is trying to solve?
If there's an equivalent workaround (in the other direction) with ES / Kibana, that'd be awesome in the docs.
There are kind of three cases where we do a different thing than postgresql with sum. Probably more, but three I can think of. But here goes:
doc_count
in the bucket the sum
is inside - if its 0 you can do null
style stuff.null
or an empty arrays. You can use "missing": "NaN"
to get NaN
. It ain't great, but its a thing.You only ever sum
null
or an empty arrays. You can use"missing": "NaN"
to getNaN
. It ain't great, but its a thing.
Wait, that isn't right. That'll get you NaN
when any documents are missing the value.
Wait, that isn't right. That'll get you
NaN
when any documents are missing the value.
You'd need to run another aggregation to count docs that have missing values. It'd pretty terrible frankly. It looks like:
curl -uelastic:password -HContent-Type:application/json -XDELETE localhost:9200/test1
curl -uelastic:password -HContent-Type:application/json -XDELETE localhost:9200/test2
curl -uelastic:password -HContent-Type:application/json -XPUT localhost:9200/test1/_doc/1?refresh -d'{"a": "a", "some_missing": 1.0}'
curl -uelastic:password -HContent-Type:application/json -XPUT localhost:9200/test2/_doc/2?refresh -d'{"a": "a"}'
curl -uelastic:password -HContent-type:application/json -XPOST localhost:9200/test*/_search?pretty -d'{
"aggs": {
"a": {
"terms": {
"field": "a.keyword"
},
"aggs": {
"min": {
"min": {
"field": "some_missing"
}
},
"missing": {
"missing": {
"field": "some_missing"
}
},
"sum": {
"sum": {
"field": "some_missing"
}
}
}
}
}
}'
I suspect folks don't typically encounter this because they treat our fields like they are non-nullable and single valued. Do you bump into this sort of thing frequently? documents with uneven fields that encode meaning in the absence of that field?
Thank you. That explanation definitely helps me, and I think it may give me a way around some of the issues that may more specifically just be limitations around how we're using Kibana out of the box for more general visualizations of data that aren't the classic use case.
For us lately, the absence of data tends to be real and we don't want to see it plotted in Kibana visualizations when doing things like applying filters with Sum aggregations.
I've recently been working with visualizing some agile metrics from tasking data where we want to see things like burn down with projections where we ideally want the line to start beyond x-axis item 0+ where we have data. If I handle all aggregations prior to Kibana, it isn't too tough to make the visualization look nice, but as soon as users ask about multiple filters using sums, I don't currently have a good answer for them and they dislike the line starting from 0 and ramping up to the true starting point every time. Something like the Jira visualization here isn't a bad example
It seems like there's some interest in changing this to null
like you asked. Check out the PR. Its a common case and just coalescing nulls to 0 if you want 0 is a good solution. If we'd realized it when we first made it we'd have returned null. We just disagree some on whether its worth building a whole migration path. If we had all the time in the world I'd be for it. But we'll talk more and see where we land.
Elasticsearch version (
bin/elasticsearch --version
): 7.12.0OS version (
uname -a
if on a Unix-like system): OSXDescription of the problem including expected versus actual behavior: The sum aggregation by default should ignore documents that are missing values for the field being summed. Instead, it looks like it defaults to a
0
value. The min and max aggregations have the same default behavior, but correctly ignore missing fields and returnnull
Steps to reproduce:
Results showing correct handling of the missing
baz
field by min (i.e. anull
), but incorrect handling by sum (0
)Provide logs (if relevant):