Indicating time trends in topN results

leventov commented 5 years ago

This issue elaborates on one of the aspects mentioned in the discussion about Error bounds / probabilities / skewness as first-class Druid query results: time trends for single-valued query types such as topN and groupBy. In different time intervals, results may be different, for example:

interval=00:00_01:00
country    website_visits
China      2000
Argentina  1000
USA        900

...

interval=05:00_06:00
country    website_visits
Argentina  1500
USA        1300
China      1000

-- China goes to bed, Americas wake up.

When we aggregate just those 5 hours, there is a relative trend for each dimension value (country). We may detect this when aggregating results on Broker and send a enum along with the value for each dimension value or grouping key:

consistent upward trend
consistent downward trend
no trend (the relative weight of the key stays realtively stable), or a cycling trend: relative weight goes up and down and finishes approximately at the same place where it has started.
big variance, but no particular trend.

Then in query UIs, consistent upward and downward trends may be represented as small green and red arrows (those that are used in stock market interfaces :)

Big variance, but no particular trend may be indicated by something like an asterisk or different tone of the row or an exclamation mark in a circle, indicating that the total aggregate value probably has low statistical significance, or there is some problem with the data (see #7160).

FYI @mistercrunch @julianhyde @vogievetsky

leerho commented 5 years ago

I don't know what algorithm is being used to establish the "TopN", but I hope it isn't the old

Sort each node
Take the top 1000 from each node
Merge the samples from each node together and sort
Take the top 100 from that list and hope (yes hope) that you have captured the top 100.

This can result in unknown and significant errors, depending on how the data is distributed across the nodes, and cannot give any information about which ones in that final list actually belong there, what the error might be in its position in the list, and which ones in the list might just be noise. This technique is just a heuristic and it is not hard to prove how it can return very misleading results.

This can be can be much more efficiently, accurately, and confidently solved using the DataSketches Frequent Items sketch (FIS).

The FIS will tell you the upper and lower bounds of the actual weights of each item, which can be used to establish the range within which the true weight must lie. However, when those ranges overlap, it creates a tie situation in terms of actual order. Nonetheless, these bounds can be used to identify which items may be tied, for example, for the 3rd position. These bounds, by the way are hard bounds, so the confidence is 100%.

Also, you can query the sketch using either the NO_FALSE_POSITIVES (more strict) or a NO_FALSE_NEGATIVES (less strict) query. But under no circumstances with the sketch return items that fall below the noise floor.

Be aware that the reason that this sketch is called "Frequent Items" is that there may not be any items that are more frequent than any of the others (I.e. if they all occur with the same frequency). Thus, returning zero results can happen. Thus the name "TopN" is misleading as there may not be N items returned.

This is why I prefer to call this kind of operation "Frequent Items" or "Heavy Hitters" and not "TopN".

leventov commented 5 years ago

@leerho yes, we do exactly the "old" algorithm that you described :) See #1135. Thank you for letting us know about how the algorithm can be improved! I've opened a separate issue #7187 to track your proposal.

leerho commented 5 years ago

@leventov BTW, I have checked into DataSketches/sketches-core a new sketch I have specifically developed with this issue in mind for Druid. It is called Frequent Distinct Tuples (FDT Sketch. I would appreciate it if someone could experiment with it. So now there are 2 sketches we have to help Druid deal with “TopN” issues. This plus the FIS Sketch. I can make myself available for a short tutorial if anyone is interested on how to get started.

leventov commented 5 years ago

@leerho I support your stance in #7187. But, unfortunately, I won't be able to work myself on these ideas in the near future.

gianm commented 5 years ago

@leerho Do you have any documentation on what the FDT sketch is capable of? I'm curious how it can be mapped onto two important things: use cases and SQL.

leerho commented 5 years ago

@gianm Yes, the initial docs are the Javadocs, (duplicated on the website) I am working on more extensive docs including some accuracy characterization studies, which are more complicated to do. Meanwhile: See FdtSketch Javadoc.

Also see unit tests for some real simple examples to get started. But don't use the FdtSketch(lgK) to configure, use the FdtSketch(threshold, rse) to configure the sketch. You can start out by trying numbers like FdtSketch(0.01, 0.05), which means you will capture distincts that are greater than at least 10% of the overall population of distincts where the Relative Standard Error (at 68% confidence) will be 5% or better.

leerho commented 5 years ago

@gianm

see #7187

apache / druid

Indicating time trends in topN results #7161