bcongelio / nfl-analytics-with-r-book

The repo for Introduction to NFL Analytics with R (published with CRC Press)
https://bradcongelio.com/nfl-analytics-with-r-book/
Creative Commons Zero v1.0 Universal
54 stars 16 forks source link

First k-means graph data is opposite the example #6

Closed shen3340 closed 1 year ago

shen3340 commented 1 year ago
rushing_kmeans_data <- vroom("http://nfl-book.bradcongelio.com/kmeans-data")

rusher_names <- rushing_kmeans_data$player
rusher_ids <- rushing_kmeans_data$player_id

rushers_pca <- rushing_kmeans_data |> 
  select(-player, -player_id)

rownames(rushers_pca) <- rusher_names

rushers_pca <- prcomp(rushers_pca, center = TRUE, scale = TRUE)

fviz_pca_biplot(rushers_pca, geom = c("point", "text"), ggtheme = nfl_analytics_theme()) + 
  xlim(-6, 3) + labs(title = "PCA Biplot: PC1 and PC2") + 
  xlab("PC1 - 35.8%") + ylab("PC2 - 24.6%")

On Chapter 5's first K-means plot, the PC2 data seems flipped with the example plot. On my plot, it shows Dalvin Cook with -3 PC2 but on your graph it shows Cook with +3 PC2. Was wondering why the data seems opposite for every player.

image

bcongelio commented 1 year ago

That is really odd.

I just ran the code you provided and got the graph as displayed in the book (ie., with Dalvin Cook at +3).

What version of factoextra do you currently have installed?

shen3340 commented 1 year ago

My factoextra package is version 1.0.7

bcongelio commented 1 year ago

I am quite puzzled by this. I again ran your code and my output was as in the book, and not reversed like yours.

Can you run facto_summarize() on your data and paste the results here?

https://rpkgs.datanovia.com/factoextra/reference/facto_summarize.html

bcongelio commented 1 year ago

A typo was discovered in this section that could explain your issue.

The text of the book is as follows:

As a result of the PCA process, we know we will be grouping the running back into four distinctive clusters, so we will create a value in our environment called k and set it to 4.

However, when setting the number of clusters, the following code is used:

k <- 3

If you happened to do k <- 4 instead of k <- 3, it could explain your issue.

Let me know.