Investigating "fair clustering"

Problem: As discussed offline with @colinmegill and @metasoarous , under-resourced minorities have very little time to participat so are under-represented and/or have to rely on other relaying their voices. How can we take this representation imbalance into account in the clustering?

Suggested solution: I just stumbled upon this article on Socially fair K-means (Ghadiri et al. 2020), suggesting an algorithm to handle such issues. From the abstract:

We show that the popular k-means clustering algorithm (Lloyd's heuristic), used for a variety of scientific data, can result in outcomes that are unfavorable to subgroups of data (e.g., demographic groups). Such biased clusterings can have deleterious implications for human-centric applications such as resource allocation. We present a fair k-means objective and algorithm to choose cluster centers that provide equitable costs for different groups.

It could be worth reading it in detail and experimenting with it on existing datasets.

Alternative suggestions:

Obviously that doesn't solve a complete absence of representation
and is research so no guarantee it works :)
There is likely a bigger litterature on the topic that I have not explored.

Thanks! This is a fantastic suggestion @jucor! I'll pull on this thread a bit.

Worth looking into, also Fair PCA mentioned in the article you shared. Nice catch Julien! 👌

While you may have simply picked one off the shelf without too much investigation, this paper's:

• algorithm; and • conception of fairness

are pretty solid, and I think it would indeed be a great idea for Pol.is to move towards using this algorithm in place of merely using more common k-means optimization algorithms.

Fairness of What?

(this section title is cribbed from: Equality of What?: https://www.ophi.org.uk/wp-content/uploads/Sen-1979_Equality-of-What.pdf - a famous lecture given by Prof. Amartya Sen laying out the capabilities approach: https://en.wikipedia.org/wiki/Capability_approach)

There is a large and expanding body of work on fair clustering algorithms and as one might suspect, with anything titled “fairness,” it's a sphere of ideological competition. The reason is of course that what one means by fairness is a strong moral/political/philosophical position. Fairness is not in fact a term that has any specific consensus meaning. For example there is the rather intuitive dichotomy of procedural fairness vs. outcome fairness. The interpretation of the term is even more contentious when one explicitly brings in the considerations of race and gender.

Uniform Clustering Cost Across Categories

The machinery of fairness in the paper you cite is, in essence, a machine for a notion of outcome fairness across, usually demographic, categories. This can be applied to race, gender, and intersectional equity by careful selection of categories. It'll be easier to be clear about this after going over precisely what optimized functions for k-means and fair k-means are, and how they relate. (Forgive the retreading of technical expressions with which the reader of this may well be familiar - it is necessary to establish notation and for making clear the comparison)

k-means

Given:

• a subset $U$ of points in an ambient space M with a distance function $\left|\left|\_\right|\right|:M\longrightarrow\mathbf{R}$ ;

• a subset $C=\left\{ c_{1},c_{2},\dots,c_{k}\right\} \subset M$ of centers; and

• a partition $\mathcal{U}=\left\{ U_{1},U_{2},\dots,U_{k}\right\} \subset\mathsf{P}\left(M\right)$ of $U$ , (meaning:

– $U_{i}\cap U_{j}=\varnothing$ if $i\neq j$ ; and

– $\coprod_{i=1}^{k}U_{i}=U$

We set

$\triangle\left(C,\mathcal{U}\right)=\sum_{i=1}^{k}\sum_{u\in U_{i}}\left|\left|u-c_{i}\right|\right|^{2}$

The partition $\mathcal{U}$ is of course the clustering $U_{i}$ being the set of points considered to be clustered around $c_{i}$ , and $\triangle\left(C,\mathcal{U}\right)$ is to be interpreted as the cost of a particular clustering. The usual k-means algorithms optimize $\triangle\left(C,\mathcal{U}\right)$ over the space of partitions.

Concentration of Cost

It's important to note that optimizing $\triangle\left(C,\mathcal{U}\right)$ is equivalent to optimizing the mean square-of-the-distance between points and their assigned centers. Herein lies the weakness:

• a priori we have no control over the concentration of the cost across $U$ ; and

• it can safely be said (i.e. this is a well known experimental phenomenon) that most of the time, if $U$ is made of people, this abstract cost will be concentrated on women and people of color.

Fair k-means

Fair k-means operates not only data points alone, but data points together with a, usually demographic, partition and optimizes the maximum of the mean square-of-the-distances across those demographic categories, using the same centers, in the following sense.

As above let:

• $U$ be of points in an ambient space M with a distance function $\left|\left|\_\right|\right|:M\longrightarrow\mathbf{R}$

• let $C=\left\{ c_{1},c_{2},\dots,c_{k}\right\} \subset M$ be a subset of centers

• let $\mathcal{U}=\left\{ U_{1},U_{2},\dots,U_{k}\right\} \subset\mathsf{P}\left(M\right)$ be a partition of $U$

and:

• let $\mathcal{D}=\left\{ D_{1},D_{2},\dots,D_{n}\right\} \subset\mathsf{P}\left(M\right)$ be another partition of $U$ .

Fair k-means optimizes

$\Phi\left(C,\mathcal{U},\mathcal{D}\right)=\max_{j=1,\dots,n}\triangle\left(C,\mathcal{U}\cap D_{j}\right)$

over the space of partitions $\mathcal{U}$ where by $\mathcal{U}\cap D_{j}$ we mean the partition of $D_{j}$ got by index-wise intersection. Note that in each $\triangle\left(C,\mathcal{U}\cap D_{j}\right)$ the sum is effectively over $D_{j}$ and not over all of $U$ .

Fair k-Means and Equity

The utility of fair k-means towards the ends of race and gender equity is of course got by careful selection of the partition $\mathcal{D}$ . I think that this is great! The algorithm is entirely agnostic about the cause of inequity, we simply put in, explicitly, as data, the demographic categories across which the clustering must be as equitable as possible.

For Pol.is: Picking Demographic Categories and Intersectionality etc.

For Pol.is this does require that one has this demographic information of all participants and more that the categories are carefully chosen. For example the phenomenon called intersectionality, the reality that, say, Black women do not merely experience the the structural injustices and prejudices for Black persons and the structural injustices and prejudices against women, but that those structural injustices and prejudices experienced by Black women are in fact specific to being both. As such, it probably isn't enough in many cases merely to put in place an exclusively racial demographic categorization. One needs to know quite a bit about the context! As such, I think the instantiation of fair k-means as part of Pol.Is requires developing a process which develops an appropriate demographic categorization for each deployment.

Thanks for your detailed summary and thoughts on this @PaulLessard!

Given that this algorithm requires explicit demographic partition, we need better UI for obtaining this information before we can adopt this methodology. Having people answer a dozen questions like "I identify as Black", "I identify as Asian", etc., as we currently do for participant metadata, is not tenable if the information is going to be this central to the algorithm. In addition there's probably also going to be the need for more nuanced control on the part of admins/moderators to decide whether this information is required, whether these questions should be asked upfront or randomly, etc. It also (I think) nods towards a rethink of how we approach metadata more generally, which is probably a good thing for the software overall. However, all of this amounts to a fair bit of UI work.

As pointed out above, demographic breakdown is going to be highly context dependent, and so we're going to also need quite a bit of thought put into how we guide admins/moderators towards making these decisions ethically.

Final thoughts on the alg itself: In their experimental evaluations, the authors only investigated 3 different (fairly large) datasets, with a max of 5 demographic categories. I think before we adopted something like this we'd want to do much more investigation into how the algorithm handles a) more fine-grained (intersectional) demographic splits, and b) smaller datasets, where demographic representation is more sparse or skewed; Are we able to find any weird edge cases, and is there anything we can do about them?

These methods also have me thinking about whether the same strategies could be applied towards some of the other methodologies we've considered for more in-depth analysis, such as UMAP and Leiden. Fair-PCA is also mentioned in this article, and while it seems that the effect of applying Fair-PCA prior to Fair-Loyd may be pretty minimal relative to Fair-Lloyd by itself, applying vanilla-PCA as a pre-processing step may increase the cost. And in these cases they're applying PCA to reduce the dataset to K dimensions; We always project to 2D before clustering (in order to avoid clusters overlapping each other in the viz, which maybe we shouldn't do if we care more about the groupings than the viz), so there's reason to think that this increased cost associated with PCA as a pre-processing step could be even higher in our setting.

Thanks again @PaulLessard for digging into this, and again to @jucor for pointing out this method!

Thanks @metasoarous !

I think it would be very interesting and a worthwhile to look at the dimension reduction methods in UMAP and see if they can be replaced with a demographically equitable variant - I suspect the analogous issue, demographic concentration of dimension reduction cost pop up at that stage of UMAP. I haven't found anything about it in the literature yet on this, though my search is still cursory. It suspect it would be a real contribution to one of the more mathematically principled methods, to introduce some important ethical and political principles.

Just jumping into the conversation:

To clarify: You want to cluster over participants wrt their votes for/against a set of comments, correct? But you are finding that outlier opinions are becoming obscured?

Definitely, the preference for maximizing observed variance (i.e. vote counts) under PCA---coupled with looking for exactly "k" clusters---can minimize the recognition of outlier opinions.

Question: In your UMAP approach, are you bootstrapping the low-d representation? Since the input graph is sparse, the missing edges may assume highly variable values once nodes are (stochastically) embedded onto a Euclidean space. Participants (nodes) whose embedded neighborhood changes a lot with each iteration could be asked follow-up questions to help fix their position in the embedding.

Re clustering: my own work studying dynamical brain networks tends to segment 2d point clouds (e.g. embedded connectivity maps) using a watershed transformation. This method can be configured to provide outlier nodes with their own clusters.

Hi @JacobCWBillings. Thanks for sharing your thoughts on this!

This issue isn't so much about preserving outlier opinions as it is about finding ways of ensuring that minority demographics are being well represented by the groups they are assigned to. Thus, the modified cost/fitness function takes into account not just the typical k-means criteria of average distance within each cluster to their center of mass, but how these values break down by demographic group.

As @jucor pointed out, this is potentially a valuable approach to dealing with situations where a particular demographic group are underrepresented (either relative to the their actual proportion of the population, or within the population as a whole). There are some nice illustrations in the paper that show how this can lead to clusterings different from what k-means might produce, which I think give a good intuition of what the method is after. I hope this clarifies things a bit.

All that having been said, I'm keen to hear more about your thoughts on UMAP. We aren't currently using UMAP as part of the core algorithm, but we've been experimenting with it a bit in our post-conversation analyses. There are some challenges here with respect to how skewed the number of votes per participant can be, and lots of decisions about how best to create a graph out of the vote data, amenable to a UMAP analysis; One approach is to use KNN, with some custom distance metric that tries to account for how many votes participants have in common. But there are other ways to go here as well. There are some (possibly difficult to read) notes from a discussion of this in #1163, if you want to take a look (and we've had some further conversations about this since then, from which I might be able to dredge up some additional thoughts). All of this is a bit tangential to the "fair clustering" issue, but please feel free to elaborate on your thoughts regarding UMAP in a new discussion.

Thanks again!

:+1: to Chris's answer highlighting the difference between outlier and socio-demographic minority. Adding to this, to show the intricacy of the matter:

collecting sociodemographic data so as to rebalance it is not a decision to take lightly: the more data you collect, the more risks you take of data leaks or bad uses.
there is also not a clear path to reconcile anonymity and data collection without opening the door to adversarial biasing.

On UMAP/tSNE, impressive as they can be, there is a lot of debate currently in scientific papers about the validity of the insights they bring. See https://www.nature.com/articles/s41467-019-13056-x (Kobak et Berens, 2019) the followup https://www.nature.com/articles/s41587-020-00809-z (Kobak et LInderman, 2021). Basically: they can be great, but it's also very easy to shoot oneself in the foot.

Hence why so far pol.is has stuck to "tried and true" methods like PCA: not the most cutting-edge, and not possibly as insightful, but at least their failure cases are somewhat well understood. Given the responsibilities of the tool, we take a "minimax" approach of sorts :)

On Sun, Oct 3, 2021 at 6:39 PM Christopher Small @.***> wrote:

Hi @JacobCWBillings https://github.com/JacobCWBillings. Thanks for sharing your thoughts on this!

This issue isn't so much about preserving outlier opinions as it is about finding ways of ensuring that minority demographics are being well represented by the groups they are assigned to. Thus, the modified cost/fitness function takes into account not just the typical k-means criteria of average distance within each cluster to their center of mass, but how these values break down by demographic group.

As @jucor https://github.com/jucor pointed out, this is potentially a valuable approach to dealing with situations where a particular demographic group are underrepresented (either relative to the their actual proportion of the population, or within the population as a whole). There are some nice illustrations in the paper that show how this can lead to clusterings different from what k-means might produce, which I think give a good intuition of what the method is after. I hope this clarifies things a bit.

All that having been said, I'm keen to hear more about your thoughts on UMAP. We aren't currently using UMAP as part of the core algorithm, but we've been experimenting with it a bit in our post-conversation analyses. There are some challenges here with respect to how skewed the number of votes per participant can be, and lots of decisions about how best to create a graph out of the vote data, amenable to a UMAP analysis; One approach is to use KNN, with some custom distance metric that tries to account for how many votes participants have in common. But there are other ways to go here as well. There are some (possibly difficult to read) notes from a discussion of this in #1163 https://github.com/compdemocracy/polis/discussions/1163, if you want to take a look (and we've had some further conversations about this since then, from which I might be able to dredge up some additional thoughts). All of this is a bit tangential to the "fair clustering" issue, but please feel free to elaborate on your thoughts regarding UMAP in a [new discussion].

Thanks again!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/compdemocracy/polis/issues/883#issuecomment-932994649, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFBERL5YGKEQXVSXQLVIZ3UFCINRANCNFSM4YTGQVVA .

-- Julien Cornebise, Ph.D. (he/him) Honorary Associate Professor Department of Computer Science University College London https://cornebise.com/julien

Sorry. I'm not getting the context here:

"This issue isn't so much about preserving outlier opinions as it is about finding ways of ensuring that minority demographics are being well represented by the groups they are assigned to."

What are the "groups."

Right.

You know, it's interesting that much of this discussion centers around the notion of fair representation. To me, this means that the analysis should discover patterns that best reflect the inherent structure of the data.

Regarding clustering, if the data do not exist as clusters (especially as exactly k clusters), then pushing the data to assume such a format imposes implicit biases that could be problematic. (E.g., when and how does it make sense to cluster points on a noiseless sphere? Is the inability to form a k+1 minority group a kind of gerrymandering?)

As an alternative to thinking about a static clustering, my day job has started to work with the explicit structures that emerge in graphs. This approach works rather well for the current dataset, the Polis Opinion Matrix (POM), as elements of the matrix are topological primitives---i.e., signed edges connecting participants and comments.

Since we are only interested in how participants connect to one another, we can look at the topology of the nodal set of participants, and throw in an edge between participants if they agree about a particular comment.

Regarding clustering: clusters are then seen as the explicit occurrence of n-dimensional cliques---i.e., as sets of fully connected nodes. Minority groups are then cliques that are relatively disconnected to the majority clique complex.

Regarding embeddings: UMAP appears to be a well-motivated choice for POM sparse graphs. This because UMAP seeks to compute a topology over the dataset, and then to preserve that topology in the low dimensional space. You can actually go in and directly input POM's topological representation into UMAP. No guarantees about preserving completely disconnected groups, though, as the process of embedding imposes the bias that some distance exists between nodes. (And some completely disconnected group 'A' will still embed next to other groups 'B' and 'C' with equal probability).

Back to the idea of fairness. It's difficulty to assume distances between participants that are not present in the POM. So, perhaps it is more useful to stick to the representation of clique complexes. However, the POM is sparse because of undersampling, not because people don't have even marginal opinions about some comments. That being said, could populating the POM include some kind of vote delegation process? This might make more sense if the data generation process included something like participant vote weights? (For example, Alice might delegate 0.04% of her total votes to match Bob's votes.) The people at democracy.earth make use of weighting and delegation, though they haven't started into visualizations (https://github.com/DemocracyEarth/paper#33_Universal_Basic_Income).

@jucor You're absolutely right about the anonymity concerns here. It's also entirely possible that some groups being more concerned about this information than others could skew participation. This might be mitigated by explaining how/why this data is used (to ensure marginalized populations are well represented in the resulting opinion groups), but this comes with some cost as well. Much to consider here...

As for adversarial biasing, it does seem that certain ML methods can become more susceptible to such attacks with "fair" methods. A quick literature scan pulled this up: https://arxiv.org/pdf/2006.08669.pdf (I think it's looking at NN models, but I can imagine how the results might generalize).

And yes, I have been following the debate around UMAP & tSNE; @colinmegill pointed me to this thread/paper which I found interesting: https://twitter.com/lpachter/status/1431326007303102466?s=19 (to be fair, the rebuttal: https://twitter.com/tim_sainburg/status/1431497387504144394?s=20). In short, a hard agree here that for the moment, it's a good idea to continue using simpler methods (PCA + K-means) in the core algorithm. It's much easier to justify using fancier methods like UMAP or Leiden in "post analysis", where a data-scientist can go over the results thoughtfully (as Sainburg points out in defense of the methods).

@JacobCWBillings I'm happy to continue a conversation about how best to apply UMAP or other approaches to clustering, but I really think these are topics for another discussion (which I've kickstarted here).

What I will respond to here regarding the selection of K (for K-means, since it does directly relate to the issue at hand, and touches on a concern you bring up) is this: We don't choose K willy nilly, but rather use the Silhouette Coefficient (a common technique for this task). I actually meant to bring this up in my earlier post, because I think if we did apply something like Fair K-means, it may be appropriate to apply a modified version of the silhouette coefficient as well.

Now that our inaugural paper is out, I'd recommend taking a look to get a better sense of what we're currently doing.

Thanks for your thoughts folks!

Reading the paper: it and the online documentation seem consistent. And so, my position remains much the same.

I applaud what you all have so far accomplished, but I am concerned about what is lost when trying to, metaphorically, "paint reality in black and white."

But this is a problem of pattern recognition, broadly. Patterns, by definition, blur the details of a system to resolve some subset of relatively consistent observations. And because 'pattern recognition functions' blur details in algorithmically predictable ways (see arxiv:1806.01261), I would encourage future work to provide users with many 'pattern recognition functions' through which they can interact with the data. This, alongside some explication of what may be lost/gained using each technique.

An update: a similar subtopic I raised in https://github.com/compdemocracy/polis/issues/883#issuecomment-932996072, namely the lack of representativity of large studies regarding a minority in part due to lack of information about participants, is mentioned in this recent letter to Nature co-authored by my friend @datapoet .

compdemocracy / polis