Experiment Level Metrics and Statistics

VincenzoFerme commented 8 years ago

In the following I describe the experiment level metrics and statistics we should implement as Spark scripts. They are open for discussion and extension in this thread.

Some background on the type of data we have

We perform different trials for the same experiment, by making sure the environment in which we execute the experiment is stable across the trials and we ensure that the initial conditions are always the same. This means we have a pretty stable behaviours among the different runs, hence pretty similar performance measures.

Metrics and Statistics

Wighted Average of the different trials we have:
- Some references:
- Theoretical references: https://en.wikipedia.org/wiki/Weighted_arithmetic_mean
- Examples: http://uregina.ca/~morrisev/Sociology/Sost203/weighting%20on%20grouped%20data.htm
- Some notes on the computation:
- We have the following metrics for which we have to compute the weighted average across the trials: ram (weight: number of data points), cpu (weight: number of data points), process_duration (weight: number of process instances executed)
- Since we need to know the weight, I would store this information for the trials
Average of the different trials we have:
- Some notes on the computation:
- We have the following metrics for which we have to compute the average across the trials: number_of_process_instances, throughput
Q1: the [range] of the q1 among the different trials
Q2: the [range] of the q2 among the different trials
Q3: the [range] of the q3 among the different trials
95% percentile: the [range] of the 95% percentile among the different trials
Q1,Q2,Q3,95% percentile: compute the distribution of not aggregated metrics (e.g., throughput, number of process instances and so on) across the trials;
95% Confidence Interval:
- Investigate starting from the following references:
- http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Confidence_Intervals/BS704_Confidence_Intervals5.html
- http://www.kean.edu/~fosborne/bstat/06b2means.html
- For the metrics for which we compute the simple average, this should be the simple 95% confidence interval
Median: the [range] of the median among the different trials
Mode: the [range] of the mode among the different trials
Min: min of mins across the trials
Max: max of maxs across the trials
Standard deviation (clearly investigate the meaning):
- Investigate starting from the following references:
- http://stackoverflow.com/questions/9222056/existing-function-to-combine-standard-deviations-in-r
- We might want instead to use the Coefficient of Variation as discussed in section 4.5 of https://infosys.uni-saarland.de/publications/SDQ10.pdf
- For the metrics for which we compute the simple average, this should be the simple standard deviation
Best, Worst and Average trial[s]: the list of the trial_ids for the best, worst and average trial[s]. The selection would be done according to the (1st) mean and (2nd) margin of error. Then we need to store all the trial_ids with equal mean and margin of error for each of the categories.
Likelihood/Homogeneity/Variance test across the trials:
- Investigate the following references:
- http://www.ats.ucla.edu/stat/mult_pkg/faq/general/nested_tests.htm
- https://en.wikipedia.org/wiki/Homogeneity_(statistics)
- http://onlinestatbook.com/2/analysis_of_variance/intro.html
Little's law Verification:
- Investigate the following references:
- http://www.mit.edu/~dbertsim/papers/Queuing%20Theory/The%20distributional%20Little's%20law%20and%20its%20applications.pdf
- http://tripoverit.blogspot.it/2008/05/littles-law-verify-performance-test.html

ToDos

Update the cassandra schema to accomodate the metrics and statistics defined above
Implement the metrics and statistics defined above. Use Spark wherever possible, or rely on a solid statistics library for the rest (e.g., http://pandas.pydata.org, http://www.scipy.org, http://www.numpy.org). It is important to refactor the current code before proceeding with the implementation.

Cerfoglg commented 8 years ago

@VincenzoFerme What's described here is all implemented, but https://github.com/benchflow/analysers/issues/83

VincenzoFerme commented 8 years ago

@Cerfoglg document here the final set of metrics. As for example the integral and the efficiency are missing, as well as for example the following ones:

Start from your thesis.

@ivanchikj how did we define and why the aggregate metrics at experiment level for the efficiency?

ivanchikj commented 8 years ago

For the CPU efficiency on experiment level we have defined the aggregate metrics as (T1, T2 and T3 are the trials):

Weighted Average Efficiency = contribution T1 + contribution T2 + contribution T3
Contribution Tx = Usage Efficiency Tx * Weight Tx
Usage Efficiency Tx = cpu_integral_Tx / (cpu_max_Tx * number_data_points_Tx)
Weight Tx = number_data_points_Tx / (number_data_points_T1 + number_data_points_T2 + number_data_points_T3)

We apply the weighted average for CPU and RAM.

benchflow / analysers

Experiment Level Metrics and Statistics #22