[Research / Analysis] Automatic cluster label assignment

As of now we have been manually inspecting and labeling clusters through subjective evaluation of what the most significant themes are within each cluster while looking at the tweets text.

Open question Optimally this can be done automatically after clustering. Solving this along with #3 will allow for a fully automated pipeline from search & sampling to visualization.

Some possible techniques: a) Term frequency analysis of the top-k nearest neighbors to the cluster and sub-cluster centers. The top n non-stopword terms could be put together to form a label.

b) Topic modeling via Latent Dirichlet Allocation (LDA) or Latent Semantic Analysis (LSA) (gensim has APIs for this in python)

c) Assemble a "document" by concatenating the top-k nearest neighbors to the cluster and sub-cluster center and then use deep learning methods (transformers) to create a summary of the "document" to act as the label. Here is a recent neural text summarization model: Text Summarization with Pretrained Encoders

TheRensselaerIDEA / twitter-nlp

[Research / Analysis] Automatic cluster label assignment #4