Closed danielrazavi closed 3 weeks ago
Will address this once we have the bandwidth, right now we're prioritizing content review before the beginning of cohort 4.
The issue arose because when you used clusters = kmeans.fit(standardized_penguins)
to fit the K-means model, it took the entire DataFrame, including any new columns, into account. This caused problems when we later added a column for cluster labels. By creating a separate copy of the DataFrame, we avoided this issue, ensuring the model wasn't refitted with the new column, which could have affected the elbow graph.
We tested it with different numbers of clusters, and the elbow graph remained consistent, as expected. Issue resolved.
Describe your issue
In the
clustering.ipynb
notebook, there's an issue in Step 1: Creating the K-Means Model. The Python cell where the k-means clustering is performed contains the declaration of the KMeans object with then_clusters
parameter set to a specific value (e.g., 3, 5, etc.).The problem arises in the "Choosing the Optimal Number of Clusters" section at the bottom of the notebook, where the elbow plot is generated. The graph of the number of clusters, k, versus the Within-Cluster Sum of Squares (WSSD) should ideally be independent of the earlier k-means model definition. However, changing the
n_clusters
value in the earlier section unexpectedly impacts the elbow plot, even though these two sections should be completely independent from each other.Expected behavior: The elbow plot generation and k-means model creation should be logically separated, with the elbow plot remaining unaffected by the
n_clusters
parameter from the earlier model declaration.Steps to reproduce
Open the
clustering.ipynb
notebook.Navigate to Step 1 where the KMeans model is created. In this step, you will see the declaration of the KMeans object with a parameter like
n_clusters=5
.Run the cell to create and fit the k-means model.
Scroll down to the section "Choosing the Optimal Number of Clusters". This section generates the elbow plot, which should display the number of clusters, k, versus the Within-Cluster Sum of Squares (WSSD) to help identify the optimal number of clusters.
Observe the elbow plot and how it behaves after changing the
n_clusters
value in Step 1.Now modify the
n_clusters
value in Step 1 to a different number (e.g., change it from 5 to 5) and re-run that cell.Run the cell in the elbow plot section again. You will notice that changing the
n_clusters
value in Step 1 alters the elbow plot, even though the two should be independent.What was the expected result?
n_clusters
value in the earlier KMeans model declaration should not affect the elbow plot generated in the "Choosing the Optimal Number of Clusters" section. These two sections should be logically separated.Put here any screenshots or videos (optional)
Put here the code owner you'd like to review this issue.
@danielrazavi