TheRensselaerIDEA / twitter-nlp

Data Analytics on Twitter with Natural Language Processing
MIT License
17 stars 7 forks source link

[Research / Analysis] Automatic clustering and subclustering of tweets in the embedding space #3

Open AbrahamSanders opened 4 years ago

AbrahamSanders commented 4 years ago

Right now we use the "elbow method" to manually choose the optimal number of top-level k-means clusters, and then we use a fixed number of sub-clusters running k-means again on each top-level cluster.

Open question We want to evaluate alternative techniques to the "elbow" method that can help automate the selection of optimal clusters both at the top-level and sub-level. Some good starting points for investigation are https://towardsdatascience.com/clustering-metrics-better-than-the-elbow-method-6926e1f723a6 and https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods/

Alternatively: Perhaps we can use hierarchical clustering and then slice up the dendrogram to get our top-level and sub-level clusters. It is also important to evaluate the feasibility of automating optimal cluster number selection when slicing the dendrogram. A good starting point (in R) is here: https://www.datanovia.com/en/lessons/agglomerative-hierarchical-clustering/