Weiqi97 commented 6 years ago

I got a bit confused working on K-Means. In current master/live server, we define PCA and Voronoi to be two different visualizations of the K-Means clustering result. However, I think K-Means, PCA and Voronoi are three different things.

PCA does a linear dimensional reduction using singular value decomposition of the data to project it to a lower dimensional space. In our case, we reduce the N (number of documents) by M (number of distinct words) DTM to a N by 2 matrix. And we perform this dimensional reduction before we do the clustering since it reduces noise. I think it makes some sense that we call our method PCA since the result is in a plane but I'm not sure if users need better explanation.
- Principal Component Analysis on Wikipedia
- Relationship between K-Means and PCA
Voronoi diagram does serve as a visualization method of K-Means clustering. We currently does this by first generating K-Means result and then put the result in Voronoi diagram format. (We actually apply PCA here too.) I found a interesting post here talking about difference between K-Means and Voronoi. One of the biggest difference the post pointed out was that K-Means method requires number of clusters to be defined before the analyze while Voronoi does not need one. And the Voronoi method in SciPy library does not need a parameter for number of clusters.

scottkleinman commented 6 years ago

Is there an implementation issue that we need to change here or do we need to make clearer how the visualisations are implemented (either in the UI or in In the Margins)?

mleblanc321 commented 6 years ago

well, i don't think our methods are wrong; ... yes, but clarity is (really) needed

i was (also) surprised that k-means (on server, v3.1.1) performs PCA before both visualizations (PCA vs. Voronoi); my stats fail me here (as to why folks often use PCA prior to clustering)

@Weiqi97 noted that SciPy has a specific Voronoi method, independent of (but similar to) k-means;

scottkleinman commented 6 years ago

I haven't looked closely at the documentation, so maybe I'm missing something, but isn't it because clustering takes place based on distances in Cartesian space, which can be calculated based on the points after PCA's dimensionality reduction? But clustering can also be done without this pre-processing step.

Weiqi97 commented 6 years ago

I did not use the Voronoi method from SciPy library. Instead, I changed the visualization method name from PCA to 2D-Scatter. Also a 3D-Scatter visualization method was added. In the documentation, we should mention that we perform PCA with 2 components before Voronoi and 2D-Scatter, as well as saying that we perform PCA with 3 components before 3D-Scatter.

Weiqi97 commented 6 years ago

758 resolved the problem discussed here.

WheatonCS / Lexos

K-Means definitions #711

758 resolved the problem discussed here.