UofT-DSI / applying_statistical_concepts

MIT License
8 stars 263 forks source link

[Bug]: K-Means Cluster Number Interference in Clustering.ipynb #92

Closed danielrazavi closed 3 weeks ago

danielrazavi commented 3 weeks ago

Describe your issue

In the clustering.ipynb notebook, there's an issue in Step 1: Creating the K-Means Model. The Python cell where the k-means clustering is performed contains the declaration of the KMeans object with the n_clusters parameter set to a specific value (e.g., 3, 5, etc.).

The problem arises in the "Choosing the Optimal Number of Clusters" section at the bottom of the notebook, where the elbow plot is generated. The graph of the number of clusters, k, versus the Within-Cluster Sum of Squares (WSSD) should ideally be independent of the earlier k-means model definition. However, changing the n_clusters value in the earlier section unexpectedly impacts the elbow plot, even though these two sections should be completely independent from each other.

Expected behavior: The elbow plot generation and k-means model creation should be logically separated, with the elbow plot remaining unaffected by the n_clusters parameter from the earlier model declaration.

Steps to reproduce

  1. Open the clustering.ipynb notebook.

  2. Navigate to Step 1 where the KMeans model is created. In this step, you will see the declaration of the KMeans object with a parameter like n_clusters=5.

    # Perform K-means clustering
    kmeans = KMeans(n_clusters=5, random_state=0)
    clusters = kmeans.fit(standardized_penguins)
  3. Run the cell to create and fit the k-means model.

  4. Scroll down to the section "Choosing the Optimal Number of Clusters". This section generates the elbow plot, which should display the number of clusters, k, versus the Within-Cluster Sum of Squares (WSSD) to help identify the optimal number of clusters.

  5. Observe the elbow plot and how it behaves after changing the n_clusters value in Step 1.

  6. Now modify the n_clusters value in Step 1 to a different number (e.g., change it from 5 to 5) and re-run that cell.

  7. Run the cell in the elbow plot section again. You will notice that changing the n_clusters value in Step 1 alters the elbow plot, even though the two should be independent.

What was the expected result?

Put here any screenshots or videos (optional)

image image

Put here the code owner you'd like to review this issue.

@danielrazavi

danielrazavi commented 3 weeks ago

Will address this once we have the bandwidth, right now we're prioritizing content review before the beginning of cohort 4.

danielrazavi commented 3 weeks ago

The issue arose because when you used clusters = kmeans.fit(standardized_penguins) to fit the K-means model, it took the entire DataFrame, including any new columns, into account. This caused problems when we later added a column for cluster labels. By creating a separate copy of the DataFrame, we avoided this issue, ensuring the model wasn't refitted with the new column, which could have affected the elbow graph.

We tested it with different numbers of clusters, and the elbow graph remained consistent, as expected. Issue resolved.