Incorrect bar ordering for unique count with terms sub aggregation

antoinebaudoux commented 9 years ago

screen shot 2015-03-10 at 17 45 15

stormpython commented 9 years ago

Adding notes to the above issue which was brought up at Elastic{ON}. Essentially, the issue is with the ordering of values in the bar chart for sub aggregations on unique count. The order should be descending by value, but due to the split, the bars are unordered by unique count.

I need to dive into the issue to debug.

stormpython commented 9 years ago

So this seems to be a bug in the vislib. Just reproduced. The response from elasticsearch seems to return the results in the correct order, however, the chart displays the data out of order.

zaakiy commented 9 years ago

+1. I have reproduced this also.

/* sent while mobile */

From: Antoine Baudouxmailto:notifications@github.com Sent: ý11/ý03/ý2015 11:49 AM To: elastic/kibanamailto:kibana@noreply.github.com Subject: [kibana] Incorrect ordering of terms sub agg (#3314)

[screen shot 2015-03-10 at 17 45 15]https://cloud.githubusercontent.com/assets/5154448/6588348/a74418d8-c74d-11e4-8ca2-5e7283a67845.png [screen shot 2015-03-10 at 17 45 58]https://cloud.githubusercontent.com/assets/5154448/6588347/a730c67a-c74d-11e4-9d9e-933dd8a4e6eb.png

— Reply to this email directly or view it on GitHubhttps://github.com/elastic/kibana/issues/3314.

antoinebaudoux commented 9 years ago

If you look at both screenshot you can see that the ordering seems to be good with the split, since it is identical to the ordering without the split. Its more the bars heights that are messed up.

stormpython commented 9 years ago

@ab-taktik yes, that is what I was referring to when I titled it ordering. By default, the bars should be ordered on the x axis in descending fashion.

blop commented 9 years ago

+1

ajrasch commented 9 years ago

+1

antoinebaudoux commented 9 years ago

Hello, any news on this? Do you have an idea what is the root cause?

antoinebaudoux commented 9 years ago

Maybe this has to do with the approximate nature of count/cardinality aggregations, and also the fact that we take only the top X terms and not all terms

stormpython commented 9 years ago

@ab-taktik I think you may be right. By default Elasticsearch sends the documents in descending order by doc_count of buckets returned. Therefore, we have been rendering bar charts with this assumption. However this is not always the case.

Take for example this dataset and this chart:

screen shot 2015-03-25 at 4 38 00 pm

As you can see, the second set of stacked bars in this example should go first. The reason it is not returned first is because the total doc_count is higher in the first bar, but when you subtract the sum_other_doc_count from the doc_count to get the value that is actually displayed, then its clear why the first set of stacked bars is smaller than the second set of stacked bars.

Best solution: Re-order the buckets returned from elasticsearch based on doc_count - sum_other_doc_count. I will add the appropriate time table for a fix.

spalger commented 9 years ago

@stormpython @ab-taktik this is just the way that aggregations work. Here is a hypothetical step-by-step of what's happening in elasticsearch:

the x-axis agg defines that the following happen
1. takes the entire result set and splits it into buckets based on scheduleFull.raw
2. the "the unique count of user.ids" is calculated for each bucket
3. the buckets are sorted in descending order based on the "unique count of user.ids"
4. the first 50 buckets are considered the source for the next phase
a copy of the the split-bars agg begins to execute inside of each bucket from step 1(i). individually
1. the bucket is split up into sub-buckets based on language.raw
2. each sub-bucket calculates it's "unique count of user.ids"
3. the sub-buckets are sorted descending based on the "unique count of user.ids"
4. the first 10 buckets are selected and returned in the elasticsearch response.

This process is precisely what we are visualizing in the second screenshot, and why we can't just subtract the sum_other_doc_count.

In the outlined steps, "unique count of user.ids" can be replaced with any metric, even "99.99th percentile", and therefore the sum_other_doc_count would not have any relevance.

@ab-taktik I think what you really want is for step 1(ii). to happen in a third phase, and for it to go more like "the sum of the 'unique count of user.ids' from the selected child buckets is calculated for each bucket" and then for 1(iii). and 1(vi). to use this new metric in order to sort and select the top 50 buckets. This functionality is something that the elasticsearch 2.0 feature bucket reducers is aiming to solve. Until it is available, I don't think this is a feature Kibana 4 will support.

spalger commented 9 years ago

Another way to think of this problem is that the buckets that create the bars are sorted based on the ordering parameters in the x-axis aggregation:

and the value used to do that sorting include documents that are excluded by the sub aggregation (grey area added to illustrate the excluded documents)

bradvido commented 9 years ago

FWIW I've reproduced this issue without using unique count metrics in https://github.com/elastic/kibana/issues/3734

driskell commented 9 years ago

Reading what @spalger says, it seems to me that the ordering is actually correct. But that the problem is the Terms Sub Aggregation for Split Bars is incorrectly excluding data, creating what is unarguably a misleading representation of the data. _Sorry about the "what @spalger is saying" - it was rude and badly phrased - I've rephrased! :+1: _

I just did a graph like this with Top 5 browser across operating systems, and all of a sudden it looked like iOS was the top operating system, but it wasn't... Windows was, it just had so many variations of browser it only showed the top 5.

There should be a part of the bar, which @spalger showed in grey, to show "Other" - this would fix both the ordering (which in my opinion is correct actually) and would fix the misleading representation of data. In my case Windows would jump up with a huge "Other" area, and the iOS would still be there at the end but much much tinier.

Summary: Ordering is fine, but what's happening is "Split Bar" + "Terms" is actually doing a "Filtered Split Bar" and filtering data, taking away all meaning from the original X-Axis aggregation. I can't see why somebody would only want to compare bars containing only the Top 5 entries...

spalger commented 9 years ago

@driskell I totally agree that we should be able to produce "other" buckets, but the feature must be implemented in elasticsearch first (see https://github.com/elastic/elasticsearch/issues/5324 for progress). Once that is implemented this will be a far less confusing experience. For now, I recommend setting the size of the aggregation to something that makes the most sense for your data.

spalger commented 9 years ago

Looks like https://github.com/elastic/elasticsearch/pull/11042, so we can move forward with #1961.

elastic / kibana

Incorrect bar ordering for unique count with terms sub aggregation #3314