elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
876 stars 24.81k forks source link

Sum aggregation not ignoring missing values #71582

Open mbarretta opened 3 years ago

mbarretta commented 3 years ago

Elasticsearch version (bin/elasticsearch --version): 7.12.0

OS version (uname -a if on a Unix-like system): OSX

Description of the problem including expected versus actual behavior: The sum aggregation by default should ignore documents that are missing values for the field being summed. Instead, it looks like it defaults to a 0 value. The min and max aggregations have the same default behavior, but correctly ignore missing fields and return null

Steps to reproduce:

#create index w/ doc
POST sum_test/_doc?refresh
{
  "foo": "bar"
}

#run min and sum aggs on non-existent "baz" field
GET sum_test/_search?size=0
{
  "aggs": {
    "min": {
      "min": {
        "field": "baz"
      }
    },
    "sum": {
      "sum": {
        "field": "baz"
      }
    }
  }
}

Results showing correct handling of the missing baz field by min (i.e. a null), but incorrect handling by sum (0)

...
  "aggregations" : {
    "min" : {
      "value" : null
    },
    "sum" : {
      "value" : 0.0
    }
  }
...

Provide logs (if relevant):

elasticmachine commented 3 years ago

Pinging @elastic/es-analytics-geo (Team:Analytics)

nik9000 commented 3 years ago

I'm not sure how longs its been like this. I can look into it. If we decide this is the right behavior we should at least document it. Folks don't always expect it.

nik9000 commented 3 years ago

Elasticsearch 1.7's sum aggregation returns 0 when aggregating on empty fields. I think we probably shouldn't change it now. But we should make sure its documented.

kspurrier commented 3 years ago

With that being the case, is there any way to have a Kibana line or bar type visualization aggregate data without plotting a point at 0 when all of the values being aggregated over the bucket are null?

mbarretta commented 3 years ago

@nik9000 I agree that it doesn't make sense to change it now if it's been this way for so long. That said, per @kspurrier, can we have a setting (or value for the existing missing) setting) that gives null vs 0 for missing values?

kspurrier commented 3 years ago

From the PR associated, Postgres is referenced which is a darn good example. Postgres can be reasonable in getting away with something like this because one can easily just use COALESCE. If there's an equivalent workaround (in the other direction) with ES / Kibana, that'd be awesome in the docs.

Example: create table es_is_not_a_db (id integer, name varchar(100), rank varchar(100), failures integer); insert into es_is_not_a_db (id, name, rank) values (1, 'Joan', 'Boss'); insert into es_is_not_a_db (id, name, rank) values (1, 'Joe', 'Worker Bee'); insert into es_is_not_a_db (id, name, rank) values (1, 'Bob', 'Worker Bee'); select COUNT(failures), SUM(failures), COALESCE(sum(failures), 0) from es_is_not_a_db;

Returns:

count sum coalesce
0   0
nik9000 commented 3 years ago

@nik9000 I agree that it doesn't make sense to change it now if it's been this way for so long. That said, per @kspurrier, can we have a setting (or value for the existing missing) setting) that gives null vs 0 for missing values?

I recall that being trickier to do than it sounded the last time folks talked about it. Could you link to an issue this is trying to solve?

If there's an equivalent workaround (in the other direction) with ES / Kibana, that'd be awesome in the docs.

There are kind of three cases where we do a different thing than postgresql with sum. Probably more, but three I can think of. But here goes:

  1. If the field is unmapped we return 0. PostgresQL returns an error. I expect kibana won't want to make these sorts of requests because it won't see the field in the field_caps API.
  2. If no documents match the request we'll give 0 and PostgresQL will return null. You'd have to check the doc_count in the bucket the sum is inside - if its 0 you can do null style stuff.
  3. You only ever sum null or an empty arrays. You can use "missing": "NaN" to get NaN. It ain't great, but its a thing.
nik9000 commented 3 years ago

You only ever sum null or an empty arrays. You can use "missing": "NaN" to get NaN. It ain't great, but its a thing.

Wait, that isn't right. That'll get you NaN when any documents are missing the value.

nik9000 commented 3 years ago

Wait, that isn't right. That'll get you NaN when any documents are missing the value.

You'd need to run another aggregation to count docs that have missing values. It'd pretty terrible frankly. It looks like:

curl -uelastic:password -HContent-Type:application/json -XDELETE localhost:9200/test1
curl -uelastic:password -HContent-Type:application/json -XDELETE localhost:9200/test2
curl -uelastic:password -HContent-Type:application/json -XPUT localhost:9200/test1/_doc/1?refresh -d'{"a": "a", "some_missing": 1.0}'
curl -uelastic:password -HContent-Type:application/json -XPUT localhost:9200/test2/_doc/2?refresh -d'{"a": "a"}'
curl -uelastic:password -HContent-type:application/json -XPOST localhost:9200/test*/_search?pretty -d'{
  "aggs": {
    "a": {
      "terms": {
        "field": "a.keyword"
      },
      "aggs": {
        "min": {
          "min": {
            "field": "some_missing"
          }
        },
        "missing": {
          "missing": {
            "field": "some_missing"
          }
        },
        "sum": {
          "sum": {
            "field": "some_missing"
          }
        }
      }
    }
  }
}'

I suspect folks don't typically encounter this because they treat our fields like they are non-nullable and single valued. Do you bump into this sort of thing frequently? documents with uneven fields that encode meaning in the absence of that field?

kspurrier commented 3 years ago

Thank you. That explanation definitely helps me, and I think it may give me a way around some of the issues that may more specifically just be limitations around how we're using Kibana out of the box for more general visualizations of data that aren't the classic use case.

For us lately, the absence of data tends to be real and we don't want to see it plotted in Kibana visualizations when doing things like applying filters with Sum aggregations.

I've recently been working with visualizing some agile metrics from tasking data where we want to see things like burn down with projections where we ideally want the line to start beyond x-axis item 0+ where we have data. If I handle all aggregations prior to Kibana, it isn't too tough to make the visualization look nice, but as soon as users ask about multiple filters using sums, I don't currently have a good answer for them and they dislike the line starting from 0 and ramping up to the true starting point every time. Something like the Jira visualization here isn't a bad example

nik9000 commented 3 years ago

It seems like there's some interest in changing this to null like you asked. Check out the PR. Its a common case and just coalescing nulls to 0 if you want 0 is a good solution. If we'd realized it when we first made it we'd have returned null. We just disagree some on whether its worth building a whole migration path. If we had all the time in the world I'd be for it. But we'll talk more and see where we land.