dashaasienga / Statistics-Senior-Honors-Thesis

0 stars 0 forks source link

article by CS researchers @ UMass #1

Open katcorr opened 10 months ago

katcorr commented 10 months ago

I added this article to the repo. I have not read it, so it may not be useful at all for you / your thesis! But (1) the three authors are all from UMass, so if something does seem relevant/interesting to you, perhaps we can follow-up with them; and (2) the one thing I did notice upon scanning is that it refers to three datasets that seem to be publicly available, so maybe are options for you as well (for down the line . . .)

dashaasienga commented 10 months ago

Thanks for sharing this! I went through it and it's definitely very useful for my thesis! I think it lay a really good framework/ foundation for how my thesis could look like in the end. Essentially, they created 2 new loss functions that would attempt to solve or mitigate the issue of induced bias. Induced bias refers to a situation where we've dropped the protected attributes such as race or gender, but we've left other attributes highly correlated with these protected attributes and can serve as proxies, essentially introducing new bias to the model. They used simulations to generate synthetic data sets that they tested their own loss functions with. They also introduced bias into real world data sets to perform comparisons with current state-of-the-art fairness ML models. They used the AIF360 toolkit I mentioned to you last week, which is open-source, so that should be a great resource when it comes to model comparisons!

It was also a great introduction to the notation used in this field and areas that I could look further into. Based on data sets, I'm thinking COMPAS is one I'd love to work with!

They suggested a potential for future work which would involve relying on realistic simulations of discriminations and test whether a given learning method is able to retrieve the non-discriminatory data-generating process. They also suggested testing the proposed methods on more complex non-linear models. These are 2 areas I'd love to discuss with them as they provide an avenue for something we could build on.

The notation was also a bit heavy for me, so I'm thinking a good place to invest my time in for the next couple of weeks is understanding the current fairness metrics and their probabilistic definitions. This will allow me to understand how exactly we can assess fairness of a method using existing statistical definitions and come handy down the line when I'm doing my own evaluations.

Other questions I have for them are:

  1. How do they determine that the outcome is not influenced by the protected attribute, even if in an indirect way?
  2. How did they measure the influence of features?
  3. What was the method of implementation and testing?
katcorr commented 9 months ago

@dashaasienga

Oh! I found a reference to their code: in the last sentence before section 2 Related Works, they state: "Our methods are released publicly via an easy-to-use FaX-AI Python library (https://github.com/social-info-lab/FaX-AI)."

Since that was going to be my main question for them to open our line of communication, I will hold off on sending the email, and see where we're at when we meet tomorrow.

dashaasienga commented 9 months ago

Thanks for catching that @katcorr! Quick question, what is the policy of using Python v R for statistics theses?

My guess is that R is probably what is expected given that is the language and environment we've been using, in which case I can begin to see which methods have R implementations. I imagine that there will likely be limitations with regard to that.

katcorr commented 9 months ago

There is no policy on programming language you use in a statistics thesis. If you feel comfortable with Python, you can work in Python! That being said, I do not use Python, so I won't be able to help as much with Python-specific questions :) But it will be good for me to have some exposure, so don't let it hold you back . . .