Closed AbrahamSanders closed 4 years ago
Addressed in c98d87d303852f766a6b629061d103fbea77b0fa.
Implemented option (c) from above: we assemble "documents" by concatenating the top-k nearest neighbors to the cluster & subcluster centers. These documents are fed to a huggingface implementation of DistilBART (sshleifer/distilbart-xsum-12-6) for summarization.
As of now we have been manually inspecting and labeling clusters through subjective evaluation of what the most significant themes are within each cluster while looking at the tweets text.
Open question Optimally this can be done automatically after clustering. Solving this along with #3 will allow for a fully automated pipeline from search & sampling to visualization.
Some possible techniques: a) Term frequency analysis of the top-k nearest neighbors to the cluster and sub-cluster centers. The top n non-stopword terms could be put together to form a label.
b) Topic modeling via Latent Dirichlet Allocation (LDA) or Latent Semantic Analysis (LSA) (gensim has APIs for this in python)
c) Assemble a "document" by concatenating the top-k nearest neighbors to the cluster and sub-cluster center and then use deep learning methods (transformers) to create a summary of the "document" to act as the label. Here is a recent neural text summarization model: Text Summarization with Pretrained Encoders