ageron / handson-ml2

A series of Jupyter notebooks that walk you through the fundamentals of Machine Learning and Deep Learning in Python using Scikit-Learn, Keras and TensorFlow 2.
Apache License 2.0
27.73k stars 12.71k forks source link

Chapter 9 - Clustering for Semi-Supervised Learning #162

Open mshearer0 opened 4 years ago

mshearer0 commented 4 years ago

I get a different set of representative digits from those in the notebook. Labelling with

y_representative_digits = np.array([ 0, 1, 3, 2, 7, 6, 4, 6, 9, 5, 1, 2, 9, 5, 2, 7, 8, 1, 8, 6, 3, 1, 5, 4, 5, 4, 0, 3, 2, 6, 1, 7, 7, 9, 1, 8, 6, 5, 4, 8, 5, 3, 3, 6, 7, 9, 7, 8, 4, 9])

produces a log_reg score of 92.4%. Alternatively using:

y_representative_digits = y_train[representative_digit_idx]

anson-07 commented 4 years ago

I got the same thing happened. Not sure why. Hopefully I can find out the reason behind this

erpda commented 4 years ago

I came to Github to find answer to related problem: why are these digits representative, using the code: What does

representative_digit_idx = np.argmin(X_digits_dist, axis=0)

argmin do here, to make these ones representative?

anson-07 commented 4 years ago

I came to Github to find answer to related problem: why are these digits representative, using the code: What does

representative_digit_idx = np.argmin(X_digits_dist, axis=0)

argmin do here, to make these ones representative?

argmin is just to find the index where the value is the minimum in the axis=0 in this case, this minimum value is equivalent to the minimum distance between the cluster's centroid and the instance, thus making it representative.

ashishthanki commented 4 years ago

I get a different set of representative digits from those in the notebook. Labelling with

y_representative_digits = np.array([ 0, 1, 3, 2, 7, 6, 4, 6, 9, 5, 1, 2, 9, 5, 2, 7, 8, 1, 8, 6, 3, 1, 5, 4, 5, 4, 0, 3, 2, 6, 1, 7, 7, 9, 1, 8, 6, 5, 4, 8, 5, 3, 3, 6, 7, 9, 7, 8, 4, 9])

produces a log_reg score of 92.4%. Alternatively using:

y_representative_digits = y_train[representative_digit_idx]

Hi,

Did you set the random_state=42 while training and splitting the dataset? Hopefully, that should solve your problem.

Thanks

Ash

mshearer0 commented 4 years ago

Thanks Ash.

Yes, random_state = 42 is set in both the split and the kmeans definition