Feature: identify statistically significant changes

Similar to benchstat, it would be useful to be able to compare all of the benchmarks run for two different apm-server builds, and identify statistically significant changes.

The way benchstat works is by:

Removing outliers using the interquartile range method
Using a Mann-Whitney U test or two-sample Welch t-test to calculate a p-value indicating statistically significant difference.

We could do this by creating two transforms, each grouping on apm-server build and benchmark name, which will:

compute the interquartile range (i.e. percentiles or boxplot agg) for each metric (e.g. events_indexed, allocations)
produce non-outlier values for each metric using scripted_metric aggs and the output of the IQR transform

Then given two apm-server builds, we can use the t_test aggregation for each benchmark/metric combination.

elastic / hey-apm

Feature: identify statistically significant changes #186