apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.3k stars 3.66k forks source link

Error bounds / probabilities / skewness as first-class Druid query results #7160

Open leventov opened 5 years ago

leventov commented 5 years ago

Describing Online Aggregation, I suggested that when Broker sends partial results back to the client it also sends a flag indicating that the partial aggregation results may be skewed. It may also send estimated error / confidence intervals of the partial aggregation values, if it is able to compute them for the given aggregation function, and if the user opts to receive such data.

I think this idea shouldn't be confined to partial query results during online aggregation and could equally apply to "final" query results (equivalent to "offline" query results).

Some of the sources of inconsistencies / error / variance:

As well as with Online Aggregation, work should be done on both the backend (Druid itself) and frontend side of UIs querying into Druid to support this and bring value to users.

In terms of antifragility, the current Druid's error-oblivious approach to query results may be classified as fragile. The approach that makes errors first-class query results might be classified as resilient or perhaps even antifragile because it might help users to learn something new about their data during abrupt events.

FYI @gianm @mistercrunch @vogievetsky @julianhyde @leerho @weijietong

leerho commented 5 years ago

I strongly support the concept that any aggregation that returns approximate results also return a means for the user to establish the likely bounds on the error along with the corresponding confidence interval.

Please note that all of the sketches in the DataSketches library provide both a-priori and a-posteriori error estimation methods.

Also, please do not confuse the built-in Druid Approximate Histogram with the DataSketches Quantiles sketch which can also produce an approximate histogram. The built-in Druid Approximate Histogram is very data sensitive and cannot provide any error guarantees. It also does not qualify as a "sketch" largely because of these issues, it is a purely empirical algorithm. Please see this comparative study.

leerho commented 5 years ago

The discussion #6099 confuses these two algorithms by associating the size and error table of the DataSketches Quantiles DoublesSketch with the Druid built-in Approximate Histogram.