dashaasienga / Statistics-Senior-Honors-Thesis

0 stars 0 forks source link

Simulation Study Ideas #27

Closed dashaasienga closed 6 months ago

dashaasienga commented 7 months ago

@katcorr

I read through the simulation article, and it was really helpful in helping me carve out ideas for the simulation section of my thesis. We'll talk about this tomorrow but here are my ideas in case you'd like to review them earlier.

IDEAS

Following the ADEMP structure:

Aim: Proof of Concept.

I'd like this simulation study to serve as proof of the Seldonian concept and evaluation of the method rather than testing when it fails. We already know it can fail, especially with insufficient data, and the solutions returned are probabilistic. Considering these limitations, I'd like to explore its performance specifically within the classification context.

Data-Generation Mechanism:

I'd like to follow a realistic mechanism that we'd expect in practice for the data generation.

Thinking about the characteristics of the COMPAS data set, I imagine a set of variables $X = (X_1, X_2, ..., X_n)$.

Some, or all, of the $X_i's$ have moderate to strong correlations with $A$, the sensitive variable. This can test performance when proxies are present in the model, even when the sensitive group itself is not. My thinking is that often, a lot of societal factors, such as the ones included in the COMPAS data set, have some correlation with race/ gender, etc.

$A$ has two levels, akin to Black v White or Male v Female.

Finally, the $X_i's$ and $A$ are correlated with $Y$, the target variable, highlighting the case where there is a lack of "objectivity" in the target variable. This plays at the fact that the data being fed into some of these algorithms is already biased, which can be further perpetuated. Let's talk more about this.

We could vary the prevalence of Y for different levels of A to hopefully show how optimizing performance for one group can disadvantage the other and examine the differences for when the prevalence of Y is more or less different for different demographic groups.

We could also vary the sample size (though we kind of did a bit of this in Chapter 2 in the regression setting) and/ or the covariate correlations (to test the extent to which proxy relationships can affect the performance of Seldonian algorithms).

I wonder if this setup is statistically sound. Additionally, I'd like to talk about how I'd generate this in practice.

Estimands/ Targets of Study:

I'm a bit confused by what the 'target $\theta$' would be in this scenario and what its 'true' value would be since the study's aim is to test the performance of the Seldonian algorithm in the classification setting. This seems like the 5th step, so I'd like to talk more about the difference between these two steps.

Methods:

I'd like to compare the standard ML approach (logistic regression?) to the Seldonian approach to start.

The fairness definition I'm most interested in is separation (equal error rates) -- I'm less interested in whether the model, on average, predicts the same for both groups but rather in ensuring that the model is not over-predicting or under-predicting for one demographic group, so that would mostly be the FNR and FPR (and maybe look at the precision as an extension?). Perhaps it might make sense to define a discrimination statistic and record the value for the models returned by both approaches.

We'd feed the same simulated data sets (generated above) to both models and record some performance measures as well.

We can also extend this to decision trees and random forests, but it makes sense to start small and then build up if needed.

Performance Measures:

In terms of performance measures, I'd like to look at:

In terms of the results, I'm thinking of 2-dimensional tables: methods x data-generation mechanism for each performance measure as below:

Screen Shot 2024-02-12 at 12 05 44

I'm also considering graphical displays such as these:

Screen Shot 2024-02-12 at 12 07 09

QUESTIONS

  1. How do we deal with Seldonian models that don't converge? Do we average over all the trials or just the convergent ones?

  2. What would be a good place to start in terms of generating the data and thinking about the appropriate family of distributions?

  3. What's the difference between this simulation study's target and performance measures?

  4. Any guidance on how to set up the motivation for the simulation study and the write-up (before actually performing the study itself)? I'd like to work on finalizing ideas and the first draft of the write-up this week in accordance with the plan (#25) and then actually begin coding the simulation study next week.

Lots of ideas here, but I'm excited to flesh them out during our meeting tomorrow!!

dashaasienga commented 6 months ago

@katcorr

The ideas here are still generally aligned with what I'd like to explore for the simulation study, but I have some more ideas for the data generation section, given what we learned from the application study. I'd love to discuss these and finalize tomorrow. I won't be on campus next Tuesday during Spring break but I'll still do some thesis work periodically. Main thing on the agenda tomorrow is that I'd love to get to a good place in terms of the simulation plan so that I can have a head start on that this week and hopefully have a decent portion of it completed by the time we come back from break :))

dashaasienga commented 6 months ago

@katcorr

See https://github.com/dashaasienga/Statistics-Senior-Honors-Thesis/blob/main/R/Simulation-DataGen.pdf, where I tried out a few different data generation mechanisms. I think it really narrows down to method 1 or method 5/6 (5 and 6 are essentially versions of the same thing).

Method 1:

This method randomly samples 1250 observations from each combination of recidivism and race, that is, Black defendants who recidivate and those who do not and White defendants who recidivate and those who do not, to create a data set of 5000 observations. Class balance is perfectly achieved. Using just 2 predictors: age (continuous) and prior_offense (categorical), we get 60% accuracy. This is 10% higher than a random model would achieve. Our previous data was only 4% higher in accuracy (70%) than what a blind model would achieve (66%), so in some sense, this is better and provides more room for the Seldonian results to vary.

Discrepancy across race is also preserved, signifying that there is room to demonstrate the Seldonian algorithm's ability to satisfy fairness constraints.

Screen Shot 2024-03-14 at 16 06 59

Methods 2, 3 & 4

Method 2 used the linear combination method to generate Y. This influenced the distribution of Y across race (because of the proxy relationships) as seen in the bar plot below. However, this model's accuracy was very high: 99.74%. In Method 3, I balanced the recidivism prevalence by race in the data set from Method 2 using the same method I had used in Method 1, but still, a 99.56% accuracy, so these are unusable. Method 4 played around with the coefficients a bunch, somehow getting up to 100% accuracy. It seemed like a futile exercise figuring out which coefficients would get the model to produce some error, because I guess I was essentially generating Y with the same predictors I'd later use to fit the model.

Screen Shot 2024-03-14 at 16 09 58

Methods 5 & 6:

In these methods, I add an error term to the linear combination before passing it into the inverse logit function and rbinom(). The data set is still balanced, but this introduced some error. The accuracy is now 93.5%, but there were limits to how much error I could induce. Too much, and NA's would be produced when passing through rbinom().

When I add 1 error term, the discrimination statistic is ~0.055. When I add 2 error terms, the discrimination statistic is ~0.065:

Screen Shot 2024-03-14 at 16 30 38

Thoughts

I wonder if I'm missing something. I'm not sure if there's a way to introduce more error into Methods 5/6, otherwise, it seems like Method 1 is probably a much better scheme? I'd love to hear your thoughts on this before I proceed.

katcorr commented 6 months ago

Hmmmm that's strange the predictive accuracy is that high in Method 2 - you shouldn't have to add random noise to the linear combination when setting p in the binary case (although I don't think it's wrong/hurts either). I just took a quick skim of the code, so perhaps I'm mistaken, but it looks like coefficient for age is set such that a one year increase in age corresponds to a exp(24) > 26 billion increase in odds of the outcome. If the relationships are set that strong, that could be the issue / why it's basically perfectly predictive?

That being said, it sounds like Method 1 works well "enough" and I think it's reasonable to continue down that direction instead of trying to figure out the other one(s). You did a TON of work! I appreciate the clear explanations stepping through your process in the PDF.

dashaasienga commented 6 months ago

@katcorr

I tried a bunch of different values for the coefficients without the noise, all of which would either result in a non-convergent glm or an accuracy that was unrealistically high. I ended up running a loop through plausible values for the coefficients in the aim to get an accuracy that was below 80% and my loop never converged at a plausible value. See below:

set.seed(93)

# loop through values of coeff_age and coeff_po
coeff_age = 1

coeff_po = 0.5
coeff_po_not = 0

while(TRUE) {
  # define a linear combination of predictors as desired
  linear_combination = - 0.34 - coeff_age * compas_sim_balanced$age + ifelse(
  compas_sim_balanced$prior_offense == 1, coeff_po, coeff_po_not) 

  # pass through an inverse-logit function
  probs = exp(linear_combination) / (1 + exp(linear_combination))  

  # generate 5000 Bernoulli RVs for y
  is_recid_sim = rbinom(5000, 1, probs) 

 # for each value of coeff_age, check values of coeff_po up to 1000
  if (coeff_po > 1000) {
    coeff_age <- coeff_age + 1
    coeff_po <- 0
    print(paste0("Coeff Age: ", coeff_age))
    next
  }

 # if the number of observations in each recidivism level is too low, continue. 
 # this caused sampling issues when inducing balance if not addressed.

  if (count(is_recid_sim == 1) < 500 | count(is_recid_sim == 0) < 500) {
    coeff_po <- coeff_po + 5
    print(coeff_po)
    next
  }

  print("Data set produced")

  # join to original data frame
  compas_sim_balanced_7 <- cbind(compas_sim_balanced, is_recid_sim)

  # induce balance
  compas_b_y <- compas_sim_balanced_7 %>%
    filter(race == "African-American" & is_recid_sim == 1)

  compas_b_n <- compas_sim_balanced_7 %>%
    filter(race == "African-American" & is_recid_sim == 0)

  compas_w_y <- compas_sim_balanced_7 %>%
    filter(race == "Caucasian" & is_recid_sim == 1)

  compas_w_n <- compas_sim_balanced_7 %>%
    filter(race == "Caucasian" & is_recid_sim == 0)

  compas_b_y_balanced <- compas_b_y[sample(nrow(compas_b_y), 1250, replace = TRUE),]
  compas_b_n_balanced <- compas_b_n[sample(nrow(compas_b_n), 1250, replace = TRUE),]
  compas_w_y_balanced <- compas_w_y[sample(nrow(compas_w_y), 1250, replace = TRUE),]
  compas_w_n_balanced <- compas_w_n[sample(nrow(compas_w_n), 1250, replace = TRUE),]

  compas_sim_balanced_7 <- rbind(compas_b_y_balanced, 
                                 compas_b_n_balanced, 
                                 compas_w_y_balanced, 
                                 compas_w_n_balanced)

  compas_sim_balanced_7 <- compas_sim_balanced_7[sample(nrow(compas_sim_balanced_7), 
                                                        5000, 
                                                        replace = FALSE),]

  # fit GLM
  glm7 <- glm(is_recid_sim ~ age + prior_offense,
              data = compas_sim_balanced_7,
              family = binomial(logit))

  print("GLM converged")

  glm7augment <- glm7 %>% 
    broom::augment(type.predict = "response")
  glm7augment <- mutate(glm7augment, binprediction = round(.fitted, 0)) 
  conf_matrix <- with(glm7augment, table(is_recid_sim, binprediction))
  conf_matrix[3] + conf_matrix[2]

  # check the accuracy; if too high, continue
  if (conf_matrix[3] + conf_matrix[2] >= 1000) {
    break
  } else {
    coeff_po <- coeff_po + 5
  }

  print(paste0("Coeff Age: ", coeff_age))
  print(coeff_po)
}

coeff_age
coeff_po
conf_matrix

# I want: conf_matrix[3] + conf_matrix[2] >= 1000 to give 80% accuracy

Eventually, I decided to go with Method 1, but my code is adjustable so that I can easily switch data gen methods because each process is independent.

As discussed, I went through a single run with the simulated data, with a special focus on finding ways to easily retrieve and save the key performance measures as objects. This makes things easier and more automated when scaling the process, rather than having to compute some things manually like I did for the application. See the results of this process here: https://github.com/dashaasienga/Statistics-Senior-Honors-Thesis/blob/main/R/Simulation_SingleRun.pdf.

There were some interesting results, a bit different from the application, that I'd love to discuss. I think scaling the simulation (maybe to 50 or 100 trials?) will be a bit more straightforward now but we can finalize those details tomorrow during our meeting!

dashaasienga commented 6 months ago

@katcorr

The new simulation parent data set has a logistic regression accuracy of 78%, perfect class balance, and a discrimination statistic of 0.2464, which is good. I re-ran a single trial of the simulation here: https://github.com/dashaasienga/Statistics-Senior-Honors-Thesis/blob/main/R/Simulation_SingleRun.pdf.

I think these results are more expected and in line with the results from the application!

Accuracy:

Screen Shot 2024-03-29 at 14 25 36

Discrimination:

Screen Shot 2024-03-29 at 14 25 51

Visualization:

Screen Shot 2024-03-29 at 14 26 10

Next Steps:

I'm currently working on scaling this to see how it performs over many trials and different sample sizes.

dashaasienga commented 6 months ago

@katcorr

This R script runs a simulation that samples data sets of varying sample sizes from the parent simulation data set, saves them for later use with the Seldonian algorithms, fits a logistic regression on each of them, and saves the results: https://github.com/dashaasienga/Statistics-Senior-Honors-Thesis/blob/main/R/Simulation/LogisticRegression/Simulation_GenData_FitLR.R

The LR results are saved here for the four different sample sizes: https://github.com/dashaasienga/Statistics-Senior-Honors-Thesis/tree/main/R/Simulation/LogisticRegression/Results

And the data sets are saved here for reproducibility in Python for the Seldonian algorithms: https://github.com/dashaasienga/Statistics-Senior-Honors-Thesis/tree/main/Data%20Sets/SimulationData

Running 50 trials per data set size is pretty quick for the LR. I'll see how long that takes on the SA algorithms when I run those tomorrow (hopefully, there won't be any major hurdles). If the simulation is not too computationally burdensome, we can definitely increase the number of trials per sample size.

dashaasienga commented 6 months ago

@katcorr

This is the Python script that runs through a single data set, performs some data pre-processing, fits 4 Seldonian algorithms on the data set (for each $\epsilon = 0.2, 0.1, 0.05, 0.01$), extracts useful metrics, and saves the results as a single row in a data frame: https://github.com/dashaasienga/Statistics-Senior-Honors-Thesis/blob/main/Python/COMPAS%20Simulation/SeldonianSimulation/seldonian_sim.py.

This is the sbatch file that loops through the 200 data sets and runs the above Python file on each data set, appending the results from each iteration to a data set: https://github.com/dashaasienga/Statistics-Senior-Honors-Thesis/blob/main/Python/COMPAS%20Simulation/SeldonianSimulation/seldonian_sim.sb. Andy suggested this method for running the simulation on the cluster and I thought it worked pretty well for the problem at hand! It took about an hour to run through all 200 trials on 2 CPU nodes -- maybe more CPU nodes might make it faster.

This is the final results data set that is created at the end of the simulation: https://github.com/dashaasienga/Statistics-Senior-Honors-Thesis/blob/main/Python/COMPAS%20Simulation/SeldonianSimulation/results/seldonian_sim_results.csv. For some reason, there are rows that are exact duplicates of each other -- maybe that's contributing to the long run time. It was an easy fix, though. Calling distinct() in R was all that was needed to drop it to the correct number of rows: 200.

Either way, I was able to complete the simulation, finally! And we can make more edits if need be after our discussion tomorrow!

dashaasienga commented 6 months ago

@katcorr

I increased the reps to 250 per sample size, so there are now 1000 total data sets. It took the better part of yesterday evening to run the simulation, but it completed successfully! The trends are still pretty consistent, but at least this way, we can be a little more confident in the results.

The results data set: https://github.com/dashaasienga/Statistics-Senior-Honors-Thesis/tree/main/Python/COMPAS%20Simulation/SeldonianSimulation/results