Open leventov opened 5 years ago
I don't know what algorithm is being used to establish the "TopN", but I hope it isn't the old
This can result in unknown and significant errors, depending on how the data is distributed across the nodes, and cannot give any information about which ones in that final list actually belong there, what the error might be in its position in the list, and which ones in the list might just be noise. This technique is just a heuristic and it is not hard to prove how it can return very misleading results.
This can be can be much more efficiently, accurately, and confidently solved using the DataSketches Frequent Items sketch (FIS).
The FIS will tell you the upper and lower bounds of the actual weights of each item, which can be used to establish the range within which the true weight must lie. However, when those ranges overlap, it creates a tie situation in terms of actual order. Nonetheless, these bounds can be used to identify which items may be tied, for example, for the 3rd position. These bounds, by the way are hard bounds, so the confidence is 100%.
Also, you can query the sketch using either the NO_FALSE_POSITIVES (more strict) or a NO_FALSE_NEGATIVES (less strict) query. But under no circumstances with the sketch return items that fall below the noise floor.
Be aware that the reason that this sketch is called "Frequent Items" is that there may not be any items that are more frequent than any of the others (I.e. if they all occur with the same frequency). Thus, returning zero results can happen. Thus the name "TopN" is misleading as there may not be N items returned.
This is why I prefer to call this kind of operation "Frequent Items" or "Heavy Hitters" and not "TopN".
@leerho yes, we do exactly the "old" algorithm that you described :) See #1135. Thank you for letting us know about how the algorithm can be improved! I've opened a separate issue #7187 to track your proposal.
@leventov BTW, I have checked into DataSketches/sketches-core a new sketch I have specifically developed with this issue in mind for Druid. It is called Frequent Distinct Tuples (FDT Sketch. I would appreciate it if someone could experiment with it. So now there are 2 sketches we have to help Druid deal with “TopN” issues. This plus the FIS Sketch. I can make myself available for a short tutorial if anyone is interested on how to get started.
@leerho I support your stance in #7187. But, unfortunately, I won't be able to work myself on these ideas in the near future.
@leerho Do you have any documentation on what the FDT sketch is capable of? I'm curious how it can be mapped onto two important things: use cases and SQL.
@gianm Yes, the initial docs are the Javadocs, (duplicated on the website) I am working on more extensive docs including some accuracy characterization studies, which are more complicated to do. Meanwhile: See FdtSketch Javadoc.
Also see unit tests for some real simple examples to get started. But don't use the FdtSketch(lgK) to configure, use the FdtSketch(threshold, rse) to configure the sketch. You can start out by trying numbers like FdtSketch(0.01, 0.05), which means you will capture distincts that are greater than at least 10% of the overall population of distincts where the Relative Standard Error (at 68% confidence) will be 5% or better.
@gianm
see #7187
This issue elaborates on one of the aspects mentioned in the discussion about Error bounds / probabilities / skewness as first-class Druid query results: time trends for single-valued query types such as topN and groupBy. In different time intervals, results may be different, for example:
...
-- China goes to bed, Americas wake up.
When we aggregate just those 5 hours, there is a relative trend for each dimension value (country). We may detect this when aggregating results on Broker and send a enum along with the value for each dimension value or grouping key:
Then in query UIs, consistent upward and downward trends may be represented as small green and red arrows (those that are used in stock market interfaces :)
Big variance, but no particular trend may be indicated by something like an asterisk or different tone of the row or an exclamation mark in a circle, indicating that the total aggregate value probably has low statistical significance, or there is some problem with the data (see #7160).
FYI @mistercrunch @julianhyde @vogievetsky