cimentadaj / ml_socsci

A work-in-progress of the notes/book 'Machine Learning for Social Science'
https://cimentadaj.github.io/ml_socsci/
6 stars 0 forks source link

Finalize the k-means clustering example from scratch #15

Open cimentadaj opened 4 years ago

cimentadaj commented 4 years ago

Due to the lack of time, you decided to use the actual plots from Gareth et al rather than implementing the K means from scratch. Here's the code you wrote so far:

library(tidymodels)
library(tidyflow)
library(ggfortify)

data_link <- "https://raw.githubusercontent.com/cimentadaj/ml_socsci/master/data/pisa_us_2018.csv"
pisa <- read.csv(data_link)

pisa <-
  pisa %>%
  rename_with(tolower)

p1 <-
  pisa %>% 
  ggplot(aes(escs, bsmj)) +
  geom_point(alpha = 1/6) +
  scale_x_continuous("Index of economic, social and cultural status of family") +
  scale_y_continuous("Students expected occupational status") +
  theme_minimal()

set.seed(5231)
pisa$random_clust <- as.character(sample(1:3, nrow(pisa), replace = TRUE))

p2 <-
  p1 +
  geom_point(aes(color = random_clust), size = 2, alpha = 1/6) +
  theme(legend.position = "none")

centroid_dt <-
  pisa %>%
  group_by(random_clust) %>%
  summarize_at(vars(escs, bsmj), mean)

p3 <-
  p2 +
  geom_point(data = centroid_dt, aes(color = random_clust),
             size = 8,
             alpha = 0.9)

The next step is calculate the euclidean distance of all points from the centroid and iterate again until the results don't change.