WheatonCS / Lexos

Python/Flask-based website for text analysis workflow. Previous (stable) release is live at:
http://lexos.wheatoncollege.edu
MIT License
120 stars 20 forks source link

K-Means definitions #711

Closed Weiqi97 closed 6 years ago

Weiqi97 commented 6 years ago

I got a bit confused working on K-Means. In current master/live server, we define PCA and Voronoi to be two different visualizations of the K-Means clustering result. However, I think K-Means, PCA and Voronoi are three different things.

scottkleinman commented 6 years ago

Is there an implementation issue that we need to change here or do we need to make clearer how the visualisations are implemented (either in the UI or in In the Margins)?

mleblanc321 commented 6 years ago

well, i don't think our methods are wrong; ... yes, but clarity is (really) needed

i was (also) surprised that k-means (on server, v3.1.1) performs PCA before both visualizations (PCA vs. Voronoi); my stats fail me here (as to why folks often use PCA prior to clustering)

@Weiqi97 noted that SciPy has a specific Voronoi method, independent of (but similar to) k-means;

scottkleinman commented 6 years ago

I haven't looked closely at the documentation, so maybe I'm missing something, but isn't it because clustering takes place based on distances in Cartesian space, which can be calculated based on the points after PCA's dimensionality reduction? But clustering can also be done without this pre-processing step.

Weiqi97 commented 6 years ago

I did not use the Voronoi method from SciPy library. Instead, I changed the visualization method name from PCA to 2D-Scatter. Also a 3D-Scatter visualization method was added. In the documentation, we should mention that we perform PCA with 2 components before Voronoi and 2D-Scatter, as well as saying that we perform PCA with 3 components before 3D-Scatter.

Weiqi97 commented 6 years ago

758 resolved the problem discussed here.