apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.44k stars 3.69k forks source link

Weird query results when using the DataSketches Quantiles Sketch #8659

Open QiuMM opened 4 years ago

QiuMM commented 4 years ago

We used the DataSketches to compute quantiles and got very weird query results.

Affected Version

0.12.2

Description

Metrics spec at ingestion time:

"metricsSpec": [
      {
        "type": "count",
        "name": "count"
      },
      {
        "type": "doubleSum",
        "name": "cm_value",
        "fieldName": "cm_value",
        "expression": null
      },
      {
        "type": "quantilesDoublesSketch",
        "name": "cm_value_sketch",
        "fieldName": "cm_value",
        "k": 128
      }
]

My query:

 "aggregations": [
    {
      "type": "quantilesDoublesSketch",
      "name": "custom_value_sketch",
      "fieldName": "cm_value"
    },
    {
      "type": "doubleSum",
      "name": "count",
      "fieldName": "count"
    },
    {
      "type": "doubleSum",
      "name": "cm_value_sum",
      "fieldName": "cm_value"
    }
  ],
  "postAggregations": [
    {
      "type": "quantilesDoublesSketchToQuantiles",
      "name": "quantiles",
      "fractions": [
        0.1,
        0.2,
        0.3,
        0.4,
        0.5,
        0.6,
        0.7,
        0.8,
        0.9,
        1
      ],
      "field": {
        "type": "fieldAccess",
        "fieldName": "custom_value_sketch"
      }
    }
  ]

The query result:

"result" : {
    "count" : 4223.0,
    "cm_value_sum" : 667109.0,
    "quantiles" : [ 52.0, 179.0, 515.0, 929.0, 1185.0, 1426.0, 1680.0, 2047.0, 2601.0, 6000.0 ],
    "custom_value_sketch" : 529
  }

As we can see, the value of 0.5-quantile is 1185.0, so there must be nearly half of the cm_value greater than or equal to 1185.0. However, if we multiply 1185 and 2111 (half of the count) , we found the result is 2501535 which is much greater than the sum of cm_value 667109. Impossible! this should not be happen. We have loaded the same data into hive, and queried hive we got the result:

"result" : {
    "count" : 4223.0,
    "cm_value_sum" : 667109.0,
    "quantiles" : [ 70.0, 82.0, 96.0, 112.0, 136.0, 160.0, 189.0, 229.0, 274.8000000000002, 3368.0 ]
  }

@AlexanderSaydakov is there any bug of DataSketches Quantiles Sketch or I used it in a wrong way?

AlexanderSaydakov commented 4 years ago

Yes, there were some bugs in previous versions of quantiles aggregator. I don't have a list of GitHub issues and pull requests, and it is a bit difficult now to point out exactly what versions had what bug. For instance, https://github.com/apache/incubator-druid/pull/7320 Unfortunately, when quantiles sketch was fixed, a bug in Theta sketch was introduced. So I would recommend upgrading to the latest version of Druid (0.16.0-incubating).

QiuMM commented 4 years ago

@AlexanderSaydakov thanks, and is it enough if I upgrade the data sketch extension only rather than the whole Druid.

AlexanderSaydakov commented 4 years ago

I am not sure which version of the extension would be compatible with which version of Druid. It always is built as a part of the whole Druid package.

QiuMM commented 4 years ago

Okay, I'll have a try, thanks @AlexanderSaydakov