Week 3 Summary and Questions -- The Seldonian Algorithm

@katcorr

Summary

This week, I read the Seldonian paper as well as a few supporting resources.

At a high level, the work is premised on the notion that if ‘unfair’ or ‘unsafe’ outcomes or behaviors can be defined mathematically, then it should be possible to create algorithms that can learn from the data on how to avoid these unwanted results with high confidence. In other words, the Seldonian algorithm a returns, with high probability, a solution that does not break any of the pre-specified behavioral constraints.

The first step is to define mathematically what the algorithm should do. At an abstract level, this is the same across all ML.

One problem with the standard ML approach is that the user must encode constraints on the algorithm’s behavior in the feasible set or objective function:

Requires extensive domain knowledge or additional data analysis to properly balance the relative importance and the constraints.
Requires knowledge of the probability distribution from which the available data are sampled, which is often not available.

It’s important to shift this burden to the designer of the algorithm because often ML algorithms are used for critical applications by people who are experts in their field, but may not be experts in machine learning and statistics.

Note that D is a random variable and the only source of randomness in the statements regarding probability. Using this framework for designing ML algorithms involves three steps:

Define the goal of the algorithm design process i.e. the properties that the designer wants the resulting algorithm a to have. This is also known as a Seldonian optimization problem. This is in contrast to the standard ML approach, which defines the goal as to provide a solution, rather than an algorithm, with a given set of properties.

Define the interface that the user will use. The user should have the freedom to specify one or more behavioral constraints that capture the user’s own definition of undesirable behavior. So a should be compatible with many different definitions of undesirable behavior, without requiring the user to have knowledge of the distribution of D.
Create the algorithm. The designer creates an algorithm, a, which is a (possibly approximate) solution the equation in step 1 and which allows for the class of chosen constraints in step 2.

The Seldonian framework requires a to satisfy only the probabilistic constraints while attempting to optimize f.

Using the Seldonian algorithm greatly reduced the prediction error for male v female students GPA as seen below.

To ensure that the algorithm is compatible with a variety of definitions of fairness, they presented a Seldonian classification algorithm. State-of-the-art fair algorithms produced unfair behavior under at least one definition of fairness. However, the Seldonian algorithm properly limited the specific form of unfair behavior across all trials. Unlike standard ML approaches, this framework provides probabilistic guarantees that the resulting classifier is acceptably fair when applied to unseen data.

The Seldonian toolkit has an Experiments library that we can play around with. The library helps the developer understand the tradeoffs between this model and standard ML models and supports major Python ML libraries as well as Microsoft Fairlearn’s fairness-aware classification algorithms for easier integration and comparisons. It produces visuals like these:

We can use this toolkit to run some simulations with synthetic data we create and also assess impact!

Finally, Seldonian algorithms have been developed for contextual bandits, the setting where the training data and deployment data come from different distributions, and to enforce measures of long-term fairness.

I was able to read in much more detail on how the algorithm works, but this is an overview of the framework followed. Definitely looks like we'll be working with Python!

Questions

The Safety Test routine uses standard statistical tools such as Student's t-test and Hoeffding’s inequality. Are you familiar with Hoeffding’s inequality?
The researchers write: "We call an algorithm quasiSeldonian if it relies on reasonable but false assumptions, such as appeals to the central limit theorem." How would one define a "reasonable but false assumption"?

Next Steps

Play around with the Seldonian toolkit and gain familiarity with the experiments library -- comparisons with standard ML algorithms can allow us to also measure the difference in bias, which can allow us to model long-term impact.
Read on referenced paper that uses Seldonian algorithms to enforce measures of long-term fairness (to get an idea on how we can assess impact).
Look into resources shared by Phil Thomas through email.

dashaasienga / Statistics-Senior-Honors-Thesis