Open cimentadaj opened 4 years ago
Due to the lack of time, you decided to use the actual plots from Gareth et al rather than implementing the K means from scratch. Here's the code you wrote so far:
library(tidymodels) library(tidyflow) library(ggfortify) data_link <- "https://raw.githubusercontent.com/cimentadaj/ml_socsci/master/data/pisa_us_2018.csv" pisa <- read.csv(data_link) pisa <- pisa %>% rename_with(tolower) p1 <- pisa %>% ggplot(aes(escs, bsmj)) + geom_point(alpha = 1/6) + scale_x_continuous("Index of economic, social and cultural status of family") + scale_y_continuous("Students expected occupational status") + theme_minimal() set.seed(5231) pisa$random_clust <- as.character(sample(1:3, nrow(pisa), replace = TRUE)) p2 <- p1 + geom_point(aes(color = random_clust), size = 2, alpha = 1/6) + theme(legend.position = "none") centroid_dt <- pisa %>% group_by(random_clust) %>% summarize_at(vars(escs, bsmj), mean) p3 <- p2 + geom_point(data = centroid_dt, aes(color = random_clust), size = 8, alpha = 0.9)
The next step is calculate the euclidean distance of all points from the centroid and iterate again until the results don't change.
Due to the lack of time, you decided to use the actual plots from Gareth et al rather than implementing the K means from scratch. Here's the code you wrote so far:
The next step is calculate the euclidean distance of all points from the centroid and iterate again until the results don't change.