elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.93k stars 24.73k forks source link

Terms Aggregation delivers wrong results when field is SCALED_FLOAT with different scaling factors #72965

Open josefschiefer27 opened 3 years ago

josefschiefer27 commented 3 years ago

Elasticsearch version (bin/elasticsearch --version): 7.12.1

Plugins installed: []

JVM version (java -version): java version "16" 2021-03-16

OS version (uname -a if on a Unix-like system): MacOS Catalina, Version 10.15.7

Description of the problem including expected versus actual behavior: I am using Terms aggregation across multiple indices which contain a field "price" of type SCALED_FLOAT which may have different scaling factors (older indices are using scaling factor 10, newer indices the scaling factor 100).

The terms aggregation seems to ALWAYS lose precision when combining different scaling factors (see example).

Steps to reproduce:

  1. Create index my-index-000001 with a field price with scaling factor 10:

    PUT my-index-000001
    {
    "mappings": {
    "properties": {
      "price": {
        "type": "scaled_float",
        "scaling_factor": 10
      }
    }
    }
    }
  2. Insert document with price 7.3 into the index my-index-000001.

    PUT my-index-000001/_doc/1
    {
    "price" : 7.3
    }
  3. Create index my-index-000002 with a field price with scaling factor 100:

    PUT my-index-000002
    {
    "mappings": {
    "properties": {
      "price": {
        "type": "scaled_float",
        "scaling_factor": 100
      }
    }
    }
    }
  4. Insert document with price 7.3 into the index my-index-000002.

    PUT my-index-000002/_doc/1
    {
    "price" : 7.3
    }
  5. Create a terms aggregation across both indices

    GET my-index-*/_search
    {
    "size":0,
    "aggs": {
    "myaggs": {
      "terms": {
        "field": "price"
      }
    }
    }
    }

The above query delivers the following result:

{
  ...
  "aggregations" : {
    "myaggs" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : 7.3,
          "doc_count" : 1
        },
        {
          "key" : 7.300000000000001,
          "doc_count" : 1
        }
      ]
    }
  }
}

As you can see, the result contains two keys for the value 7.3. It seems there is a precision issue which generates the two keys. Please note that the value 7.3 is compatible with the data types for the field "price" in both the old and new index.

I am assuming that the values are internally converted to be compatible, I don't understand the loss in precision for this specific example.

elasticmachine commented 3 years ago

Pinging @elastic/es-analytics-geo (Team:Analytics)

josefschiefer27 commented 3 years ago

Same precision problem seems to happen when price is a runtime field in my-index-000002.

PUT my-index-000002
{
  "mappings": {
    "runtime": {
      "price": {
        "type": "double"
      }
    }
  }
}

PUT my-index-000002/_doc/1
{
  "price":7.3
}
josefschiefer27 commented 3 years ago

Any updates about the root cause or possible fixing options for the precision problem with SCALED_FLOATS?

not-napoleon commented 2 years ago

I did some investigating here, and can say with confidence that this problem is not related to mixed scaling factors. Just running the terms aggregation on my-index-000001 shows that the key for that index is always 7.300000000000001. Unfortunately everything that eventually becomes a double is subject to the imprecisions of floating point math, and 73 * 0.1 = 7.300000000000001 in floating point world.

Running a Terms aggregation on things that look like floats is dangerous. Fundamentally, terms is an equality operation, and floating point equality is not well supported. Standard practice is to check if two floating point values were within some small epsilon of each other rather than directly checking equality, for exactly this reason. In terms of aggregations, instead of a terms, I would suggest using a numeric histogram, like this:

GET /my-index-*/_search
{
  "size":0,
  "aggs": {
    "myaggs": {
      "histogram": {
        "field": "price",
        "interval": 0.1
      }
    }
  }
}

Which correctly puts both documents in the same bucket:

{
  ...
  "aggregations": {
    "myaggs": {
      "buckets": [
        {
          "key": 7.300000000000001,
          "doc_count": 2
        }
      ]
    }
  }
}
josefschiefer27 commented 2 years ago

Thank you for the investigation and suggestions. It is a workaround however it complicates things when there are nested aggregations or pipeline aggregations at play. Also when sorting values, this will lead to weird results (since 7.3 and 7.300000000000001 will show up as separate values). The precision issues of the SCALED_FLOAT data type seem to be trappy which its current implementation - some aggs and searches (e.g. requiring equality operations, sorting) simply don't work as expected.

imotov commented 2 years ago

As the documentation says scaled_float allows you to trade accuracy for less index space, otherwise it is behaves the same way as a less precise double with all caveats such as rounding errors. If you want a guaranteed precision, it is probably a better idea to just use long. I opened an issue #84064 to see if we can improve handling of scaled_floats in aggregation framework in general.