RyanGreenup / SWA-Project

Github for Social Web Analytics
0 stars 0 forks source link

Projecting Cosine Distance into Euclidean Space #10

Open RyanGreenup opened 4 years ago

RyanGreenup commented 4 years ago

In the tutorial for clustering, section 2.4, MDS was used to project Euclidean Distance into a Euclidean space so that (k) -means could be applied, I’m ok with doing that in theory I just don’t know how many dimensions to pick.

For instance in the tutorial the following was provided:

##-----------2.4 K-means with cosine distance----------

norm.tweet.matrix = diag(1/sqrt(rowSums(tweet.matrix^2))) %*% tweet.matrix
## then create the distance matrix
D =dist(norm.tweet.matrix, method = "euclidean")^2/2
#To visualise the clustering, we will use multidimensional 
#scaling to project the data into a 2d space
## perform MDS using 100 dimensions
mds.tweet.matrix <- cmdscale(D, k=100)
n = 5 #we assume elbow bends at 5 clusters  
SSW = rep(0, n)
for (a in 1:n) {
  ## use nstart to reduce the effect of the random initialisation
  set.seed(40)#seed for random number generator to ensure consistency in our results
  K = kmeans(mds.tweet.matrix, a, nstart = 20)
  SSW[a] = K$tot.withinss
}

And the argument k=100 was specified, is there a reason for chooosing 100?

Wouldn't we loose information by not choosing more dimensions, I would expect something like k = ncol(matrix_dtm)-1 and I'm not totally certain why we've chosen this.

I've emailed Dr. D'Souza but does anybody have any input?

RyanGreenup commented 4 years ago

I've decided to use the number of eigenvalues as a good value, see commit fe67140 I'm leaving this open until I hear back from Chris though