Percentage discrepancy when creating "quick values" pie chart with large data table.

casepie commented 8 years ago

When analyzing the flow logs from my firewall and building a graph of IDS alerts centered around "source_address" (source IP), I'll get a pie graph and a data table (obviously). The problem is this. Often times, when creating the query, there may be 100 or more unique values for "source_address".

 When you create a "Quick Values" chart, the pie graph is built from the numbers and percentages in the data table (maximum of 50 IPs).  But the percentages in the data table, are built based on the entire query.  So you can end up with your top IP showing up as 18% in the data table, but taking up roughly 70% of your pie graph.

Expected Behavior

One would expect the percentages for a given value (in my case, source_address) shown visually, to be the same from the pie graph, to the data table below.

Current Behavior

If you have more than 50 unique data values for the query in the field used to create your pie graph, then you'll have a discrepancy between the pie graph and the data table on the dashboard widget. The data table appears to still build it's percentage based on the entire query results. (all 100+ IP addresses)

However, Graylog only shows 50 results for source_address in the data table. The problem comes in when the pie graph appears to calculate the percentage for that value (in my case, source_address) based only on the 50 source_addresses in the displayed data table (and not on the full query results).

Possible Solution

Would suggest that the pie graph should also be calculated / drawn based on the percentage from the full query results so that the numbers there visually match what is displayed in the data table (i.e. If the data table says that IP number 10.10.16.1 accounted for 18% of the results, then that slice of the pie should visually represent about 18% of the pie graph.

Steps to Reproduce (for bugs)

This will vary from system to system but build a query that results in hundreds or thousands of results, with a key field ( the one you're going to graph on) that will have more than 50 unique values. Ideally, one or two of those values will be outliers, with many more appearances than the others. An ideal type of query for this is "top talkers" on a busy network.
Build a "Quick Values" graph based on that key field (in my example, source_address or destination_address).
compare the percentage in the data table to the visual percentage of the pie graph displayed.
Context

Our use case is based on using Juniper SRX firewall logs. We capture Intrusion Detection (IDS) logs and then build a dashboard item for "IDS alerts by Source IP". This is a "quick values" chart based on "source_address". It usually results in many hundreds of unique values for "source_address" with only a few that are statistically significant (above 3-5%). However the pie graph looks very skewed when compared to the data table.

Your Environment

Graylog Version: 2.0.3
Elasticsearch Version: 2.3.5-1
MongoDB Version: 2.6.11-1.el7
Operating System: Centos 7
Browser version: Chrome 51.0.2704

graph_discrepancy

kroepke commented 8 years ago

Good point, the result apparently ignores the sum of the long tail terms in the result set. I'll see if we can quickly fix that.

Thanks for your comprehensive report!

kroepke commented 8 years ago

For reference, we need to take into account the sum_other_doc_count return value of the aggregation. The "other" pie chart entry should then be the overflow (the values shown in the "other") table plus the sum_other_doc_count. Ideally we'd also show the sum_other_doc_count in the table somehow, e.g.:

Others (${sum_other_doc_count} values not shown)

kroepke commented 8 years ago

Turns out this was purely a display bug with the pie chart, the data table was ok. We also already use the correct numbers in the data table, but rendered them incorrectly in the chart.

Graylog2 / graylog2-server