initial exploration with kmeans

@JochemTolsma Many thanks for taking the time to review my notebooks. I appreciate it!

I am glad also that we started using github collaboratively. Just an heads-up: The main branch is now protected, meaning that any modification on the main needs approval. This means that whenever we add/modify the main this should happen in a separate branch, whereby we can experiment and discuss (like this one for example, this is the branch kmeans) and anything that "survives" from our discussions and revisions will go to main. You can create as many branches as you want and link them to issues (this issue for example is called initial exploration with kmeans). Please let me know if this is unclear, I'd be happy to have an overview on how to use github collaboratively with you.

Please see below an overview of the approaches we have used in the notebook, regarding kmeans clustering. Note that this is my opinion, and of course ultimately it is your decision what's best to use.

cheers Eva

I saw your main comment and relative change in the R quarto notebook notebook_1 (I see also that you prefer R-markdown, so I'm happy to switch to this type of notebook):

please do not do this. this way you are removing interesting information from our dataset FYI there is quite some debate on whether you want to contruct segregation measures based on proportions or on numbers

Of course, determining "correctness" in the context of measuring ethnic segregation using k-means clustering would largely depend on your specific research question and the underlying assumptions you're comfortable making about the data. There is no right/wrong approach generally. That being said, we can discuss the potential implications of both approaches, given the nature of your data.

This is the raw distribution of our ned and n_ned variables:

distributions

And this is the representation of the correlation between the two.

correlation

By inspecting the distributions and their correlation, we can see that they have 1) different scales and 2) they are very correlated with each other. I understand that we agree on this. Great :+1:

Investigate ethnic segregation using `delta` : `(ned - n_ned)` / `(ned + n_ned)`

When facing strongly correlated variables, the interesting signal lies in their difference (i.e., the part that deviates from the strong correlation trends) -- this is the reasoning behind computing Delta.

The plot immediately below shows the clustering output over this variable:

kmeans_delta

The red line divides the space in two. Points lying on the diagonal are equal numbers of observations of dutch/nondutch inhabitants, above the diagonal are higher number of observations of nondutch, below the diagonal are higher number of observations of dutch.

By examining the figure we can see that the cluster has grouped together (i) cells with similar proportions of Dutch inhabitants (ii) cells with similar proportions of Dutch and non-dutch inhabitants and (iii) cells with similar proportions of non-Dutch inhabitants. This could be useful if your primary concern is the relative concentration of Dutch inhabitants over non-Dutch in various geographic regions. However, a limitation of this approach is that it does not account for social or economic aspects of segregation (since we don't have those variables in), so the results should be interpreted in light of these limitations.

Clustering over `ned` and `n_ned` directly

By using "dutch" - "non-dutch" (raw proportions of Dutch and non-Dutch for a specific grid cell) this approach emphasizes the raw difference between Dutch and non-Dutch proportions in each cell. This means that the clustering is working on a non-scaled variables, whose distributions are highly correlated. This approach allows the kmeans algorithm to cluster the difference in population density, as depicted in the figure below:

kmeans_raw_numbers

We can see immediately that the clustering segment the data along the axis perpendicular to the diagonal line, meaning that the relevant information clustered is the high, medium, low density of population. This approach could be more appropriate if your concern is more about the balance or imbalance of populations in each cell, regardless of whether these cells are populated by dutch or non-dutch inhabitants.

Conclusion

In conclusion, the "most correct" approach depends on your specific research goals and assumptions about the nature of segregation, but also depends on the type of data that you have. In your case, your dutch variables is much larger that the non-dutch, and both variables are extremely correlated with each other. This justifies the use of the difference (Delta), and the normalisation over the total. The next paragraph clarifies the normalisation bit.

Extra: Why you should not cluster over non-normalised variables?

Why normalising the variable before running the cluster is necessary in our case? Normalisation helps the clustering to focus on the difference between the dutch/non-dutch, and not being driven by the difference in scale between the two 1. Since the range of "dutch" - "non-dutch" are very different, the clustering result might be influenced more by the variable with a larger range. So, it's important to normalize your data before applying the k-means algorithm.

As a proof of concept, I report below the output of kmeans on a non-normalised delta (i.e., I do not divide by the total). This is the result:

kmeans_delta_non-normalised

As you can see, the imbalance between the two variables drives the clustering, with very large datapoints (i.e., in the rightmost part of the plot) being the driving force of the clustering, and forcing the rest to be clustered together.

References

[1] : Is it important to scale data before clustering?

PoliticalPolarisation / polpol