Week 6 Summary and Questions -- QSA (Experimentation)

@katcorr

Overview

After completing the tutorial, I started looking into some of the experimentation the researchers performed in order to understand the Seldonian algorithm, in the context of this simple regression problem, better. We may want to repeatedly run the QSA algorithm using different amounts of data to analyze 3 main questions:

How much performance (mean square error minimization) is lost due to the behavioral constraints?
How much data does it take for the algorithm to frequently return solutions?
How often does the algorithm exhibit undesirable behavior?

After understanding the work the researchers had done on this, I started to reproduce some of their work in the Jupyter notebook, though I ran into a slight roadblock.

Performance Loss

An algorithm that guarantees that it is safe and/ or fair will typically have slightly worse performance than an algorithm that focuses purely on optimizing performance.

We first plot performance, MSE, for different amounts of data. Notice:

A logarithmic scale is used on the horizontal axis, which means we run our algorithm with values of m starting at m = 32 going up to m = 65,536, doubling each time. For each value of m, we run 1000 trials.
For each of the trials, we store the MSE returned by QSA and by OLS.

The MSE plot is obtained below with standard error bars. The thin dotted lines indicate the desired MSE range of [1.25, 2]. In this scenario, we forced MSE to be higher.

Also notice that the MSE of the solutions returned by QSA start closer to the accepted range and tend towards the lower boundary because the primary objective function encourages solutions with lower MSE.

Probability of a Solution

LS always returns a solution. QSA doesn’t always return a solution, especially when there is little data because of insufficient confidence that any solution will satisfy the behavioral constraints. How much data does QSA require to return a solution to this problem?

Notice that the probability stabilizes at 80%. That means that we can never be 100% confidence that a solution will be returned.

Probability of Undesirable Behavior

Next, we plot the probability that each algorithm produced undesirable behavior. Since LS does not take behavioral constraints into account, it will frequently produce what we have defined to be undesirable behavior.

Notice that LS frequently violates the 2nd behavioral constraint requiring MSE to be greater than 1.25 because it is in conflict with the objective function.

Running Experiments and Reproducing Results

In the Jupyter notebook, I began to set up the process for running the experiment as described above. The experiment is set up to use parallel computing (I worked with some of this over the summer, so it's not too unfamiliar).

I was able to follow along most of the code, but I wasn't able yet to run the code to completion. I'm hoping to either: 1) reduce the number of trials just to see if I'm able to get it to run to completion in a reasonable amount of time. 2) consider moving to the cluster for high-performance computing (in case the workload is too computationally intensive).

P.S. I haven't worked with the HPC cluster, but this is something Professor Spector suggested would be helpful, just to have in my toolkit. It may be especially helpful for running intensive experiments.

Questions

The authors mention that the MSE of the estimators is computed by generating more data. Here is the quote: "This raises the question: how should we compute the mean squared errors of estimators (both the estimators produced by our algorithm, and the estimators produced by least squares linear regression)? Typically, we won't have an analytic expression for this. However, for synthetic problems we can usually generate more data. We will evaluate the performance and safety/fairness of our algorithm by generating significantly more data than we used to train the algorithm. Specifically, in our implementation we train with up to 65,536 observations, and use 100 times as many points (around 6,500,000 samples) to evaluate the generated solutions." Do you understand what they mean by this?
We set our delta's to 0.1, which specifies that the probabilities of undesirable behavior should both be below 0.1 for the QSA (or at least around 0.1 due to the reliance on the normality assumption when using Student’s t-statistic). What does this second part mean? How does the normality assumption affect what we'd expect to see?

Supplementary

This is a screenshot on a statement the authors made on the probability of a solution. Not sure if we'll look into this, but I decided to include it here anyways, just in case :)

dashaasienga / Statistics-Senior-Honors-Thesis