UBC-DSCI / introduction-to-datascience

Open Source Textbook for DSCI100: Introduction to Data Science in R
https://datasciencebook.ca/
Other
50 stars 54 forks source link

Tidymodels clustering! #454

Closed trevorcampbell closed 1 year ago

trevorcampbell commented 1 year ago

@chendaniely found this and mentioned it to me -- copying the thread here. I'm very much in favour of testing this out thoroughly and possibly replacing our clustering material with this to make the book consistent / cleaner.

We can look into doing eveyrthing within tidymodels now: https://www.tidyverse.org/blog/2022/12/tidyclust-0-1-0/

For the clustering slides + worksheet + tutorial

example code from the post:

kmeans_spec <- k_means(num_clusters = 4) %>%
  set_engine("ClusterR")
kmeans_spec
#> K Means Cluster Specification (partition)
#> 
#> Main Arguments:
#>   num_clusters = 4
#> 
#> Computational engine: ClusterR

data("ames", package = "modeldata")

rec_spec <- recipe(~ ., data = ames) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors()) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_pca(all_numeric_predictors(), threshold = 0.8)

kmeans_wf <- workflow(rec_spec, kmeans_spec)

kmeans_fit <- fit(kmeans_wf, data = ames)
kmeans_fit
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: k_means()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 4 Recipe Steps
#> 
#> • step_dummy()
#> • step_zv()
#> • step_normalize()
#> • step_pca()
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> KMeans Cluster
#>  Call: ClusterR::KMeans_rcpp(data = data, clusters = clusters) 
#>  Data cols: 121 
#>  Centroids: 4 
#>  BSS/SS: 0.1003306 
#>  SS: 646321.6 = 581475.8 (WSS) + 64845.81 (BSS)

extract_cluster_assignment(kmeans_fit)
#> # A tibble: 2,930 × 1
#>    .cluster 
#>    <fct>    
#>  1 Cluster_1
#>  2 Cluster_1
#>  3 Cluster_1
#>  4 Cluster_1
#>  5 Cluster_2
#>  6 Cluster_2
#>  7 Cluster_2
#>  8 Cluster_2
#>  9 Cluster_2
#> 10 Cluster_2
#> # … with 2,920 more rows

predict(kmeans_fit, new_data = slice_sample(ames, n = 10))
#> # A tibble: 10 × 1
#>    .pred_cluster
#>    <fct>        
#>  1 Cluster_4    
#>  2 Cluster_2    
#>  3 Cluster_4    
#>  4 Cluster_3    
#>  5 Cluster_1    
#>  6 Cluster_4    
#>  7 Cluster_2    
#>  8 Cluster_2    
#>  9 Cluster_1    
#> 10 Cluster_4