Chapter 2 Outline/ Draft/ Questions

dashaasienga commented 9 months ago

@katcorr

We finally completed the STAT 495 paper so I have more time on my hands now to work on the Chapter 2 (I'm sorry for the delay with this, I just didn't expect the last week of classes to be as hectic as it's been :/)

P.S. I have no problem working on this during this and next week to compensate the time I wasn't able to spend on it last week. I only have 1 exam, so splitting my time before break between some thesis writing and studying for the exam isn't a problem for me :)

I'm using this issue to think through an outline for Chapter 2 just so I can be more organized in the writing process. I'd love any feedback regarding this!

The Standard ML Approach

Before diving into the mechanics of the Seldonian framework, I'm thinking of starting this section with an overview of the standard ML approach, discussing objective functions and how they are maximized/ minimized to arrive at a solution.

With this introduction, I'll discuss limitations of the approach from a technical standpoint, laying a foundation to understand why we observe algorithmic unfairness in the first place. I imagine going into detail about how objective functions are designed to maximize overall performance, which can have negative effects on minority demographic groups/ if the feature relationship is different for different demographic groups.

I could discuss potential remedies, which would be an organic segue into the next sub-section.

Overall, this section will be quite short and serve to motivate the topic of the chapter.

The Seldonian Framework

This section will heavily reference the original Science paper, the supplementary materials, and the website to provide a comprehensive overview of the algorithm, in particular, the mathematics and statistics behind it. I'd also like to highlight the distinction between Seldonian v Quasi-Seldonian algorithms. I will formally define the notation and walk through the key parts of the algorithm.

The aim is to serve as an introdution to the Seldonian Algorithm in theory.

Toy Example: A Seldonian Regression Algorithm

Finally, I'd end this section with the toy example we did in the Jupyter notebook. This section will include the code snippets to set up the problem as well as the results, compared to Ordinary Least Squares Regression. Finally, I'd include the experimentation results at the end.

The aim is to serve as an introdution to the Seldonian Algorithm in practice.

This will set the stage for the following chapters which will focus on a simulation study and an application to the COMPAS data set (synthesizing the Chapter 1 fairness definitions with the concept of the Seldonian Algorithm discussed in Chapter 2).

Question:

The Statistics thesis should be entirely reproducible. I'm wondering if that means I should include all the relevant code within the body of the thesis?

Let me know if you think I'm missing something, if I need to move some things around, or if a better organizational structure would make more sense!

dashaasienga commented 9 months ago

@katcorr

I'm currently working on the Chapter 2 draft. It's still in progress, but one problem I'm facing is that my ggplot figures get cut off in the pdf:

It seems the boundaries don't appear in the pdf, even when I play around with the sizes of the plot. I can definitely look into this more, but before I spend too much time on it, I thought to ask you first in case you have a quick solution, perhaps from working with Clara or other students in the past :)

Thanks!

katcorr commented 9 months ago

Your outline looks great to me -- makes organizational sense! Regarding your "Question: The Statistics thesis should be entirely reproducible. I'm wondering if that means I should include all the relevant code within the body of the thesis?" No, it will be reproducible if the code is all in the GitHub repo. It could be in code chunks in your thesis file that are not displayed in the thesis PDF. You definitely do not want to be printing to PDF all of the code. But you can use "echo: false" to suppress certain code chunks from printing the code, but will still evaluate (and still be available if someone needed to reproduce it, they can look at the .qmd files).

Hmmm in terms of the plot, I did not see that happen before with it getting cut off on the top. But don't spend more time trying to figure out now -- I can take a look when reviewing later.

dashaasienga commented 9 months ago

Sounds good! That makes much more sense!

Also, the problem magically disappeared when I knit the document in RStudio locally instead of on the R HPC server, interesting.

katcorr commented 9 months ago

Oh, glad it disappeared (and interesting re HPC server vs locally . . . will keep this in mind / may need to run by Andy)

dashaasienga commented 9 months ago

Yes, I think running by Andy if it's a consistent problem will be great since at some point I probably will need to knit from the server directly!

dashaasienga commented 9 months ago

Also, I was wondering how I should cite this web page where I got the tutorial from: https://aisafety.cs.umass.edu/index.html?

They mostly reference the original science paper and supplementary materials, so I'm wondering if I should use those as the citations for everything I obtain from here.

katcorr commented 9 months ago

I think you can just cite it as a website: https://www.scribbr.com/citing-sources/cite-a-website/#:~:text=Author%20last%20name%2C%20First%20name,%2C%20Day%20Month%20Year%2C%20URL.

dashaasienga commented 9 months ago

@katcorr

I've pushed a version of the Chapter 2 draft that is substantive enough for you to review. This Chapter was a bit harder to write since it has a lot more technical and new content, so I've tried my best to explain the concepts as clearly as I could, understanding that there is room to iron some things out so I'm looking forward to your feedback on that! The last section is not complete but I've left some comments at the bottom regarding that as well as just some general notes for future reference!

Though I'm not sure how familiar he is with the using Python on the RStudio HPC server, we may probably need to meet with Andy again to troubleshoot some things since it's not as straightforward as I had imagined. We can also ask about knitting on the cluster then. It makes sense to maybe set that up for early next semester?

Let me know once you've had a chance to look at it and add your feedback back to the repo, but there is no rush at all since I will have plenty of time next semester to revise and edit everything!

katcorr commented 9 months ago

Excellent work, Dasha! I've provided some feedback to your latest draft, and uploaded the PDF to this folder in your repo. No need to take a look until your back from break. Yes, we can set up another meeting with Andy at the beginning of spring semester.

Safe travels, and enjoy the break!

dashaasienga commented 9 months ago

Thank you, Professor Correia!

Hope you have a wonderful break as well!

dashaasienga / Statistics-Senior-Honors-Thesis