cdp13 / DAT_SF_10

Repository for data science 10 course
0 stars 0 forks source link

HW 4 review by Otto S #7

Open ostegm opened 10 years ago

ostegm commented 10 years ago

Hey Caroline,

Nice work! Hope I can provide some feedback that's helpful!

First off, I think you might have made a mistake when you set your Y values. I believe you set Y equal to the images rather than the "target". The target array holds the integer that classifies the image as person 1,2,3,etc. See this screenshot: image

I might be wrong here - but thats what I did! ...

Good job on transforming the dataset into the 2 principal components. This seemed to work out as predicted.

On K means clustering, I think you should try running it with 40 clusters. There are 40 people in the dataset, so using 64 might make it hard for the algorithm to understand what you're asking.

I made a short function to do Kmeans and print out the results as well as a graph. This can be useful. so I shared it below. One thing to note - For plotting the K Means clustering, its not really possible to do unless you're using the 2 dimensional transformed dataset (after PCA transformation) . I think you tried to plot the dataset pre-transformation. See code below for plotting, but think about using the transformed dataset instead of the original :

from sklearn.cluster import KMeans
km = KMeans(n_clusters=40, init='k-means++', n_init=10 , max_iter = 300, random_state=1)
def do_kmeans_40(km, data):
    km.fit(data)
    centroids = km.cluster_centers_
   # print "centroids:", centroids
    y = km.predict(data)

    fig, ax = plt.subplots(1,1, figsize=(8,8))
    for t in xrange(40) :
        ax.scatter(data[y == t,0],
                   data[y == t,1],
                   )

    ax.scatter(centroids[:,0],centroids[:,1],marker = 's',c='r')

For the final section, running silhouette score, try running something like this:

scores = []
for k in xrange(2,40):
    km = KMeans(n_clusters=k, random_state=1)
    km.fit(X)
    labels = km.labels_
    scores.append(metrics.silhouette_score(X, labels, metric='euclidean'))

plt.plot(scores)

This should give you the silhouette score as you change the value of K.

Hope this helps!

best, Otto

ostegm commented 10 years ago

@cdp13 @kebaler @ghego @craigsakuma