In the tutorial for clustering, section 2.4, MDS was used to project Euclidean Distance into a Euclidean space so that (k) -means could be applied, I’m ok with doing that in theory I just don’t know how many dimensions to pick.
For instance in the tutorial the following was provided:
##-----------2.4 K-means with cosine distance----------
norm.tweet.matrix = diag(1/sqrt(rowSums(tweet.matrix^2))) %*% tweet.matrix
## then create the distance matrix
D =dist(norm.tweet.matrix, method = "euclidean")^2/2
#To visualise the clustering, we will use multidimensional
#scaling to project the data into a 2d space
## perform MDS using 100 dimensions
mds.tweet.matrix <- cmdscale(D, k=100)
n = 5 #we assume elbow bends at 5 clusters
SSW = rep(0, n)
for (a in 1:n) {
## use nstart to reduce the effect of the random initialisation
set.seed(40)#seed for random number generator to ensure consistency in our results
K = kmeans(mds.tweet.matrix, a, nstart = 20)
SSW[a] = K$tot.withinss
}
And the argument k=100 was specified, is there a reason for chooosing 100?
Wouldn't we loose information by not choosing more dimensions, I would
expect something like k = ncol(matrix_dtm)-1 and I'm not totally certain
why we've chosen this.
I've emailed Dr. D'Souza but does anybody have any input?
In the tutorial for clustering, section 2.4, MDS was used to project Euclidean Distance into a Euclidean space so that (k) -means could be applied, I’m ok with doing that in theory I just don’t know how many dimensions to pick.
For instance in the tutorial the following was provided:
And the argument
k=100
was specified, is there a reason for chooosing 100?Wouldn't we loose information by not choosing more dimensions, I would expect something like
k = ncol(matrix_dtm)-1
and I'm not totally certain why we've chosen this.I've emailed Dr. D'Souza but does anybody have any input?