TheRensselaerIDEA / twitter-nlp

Data Analytics on Twitter with Natural Language Processing
MIT License
17 stars 7 forks source link

[Research / Analysis] Automatic cluster label assignment #4

Closed AbrahamSanders closed 4 years ago

AbrahamSanders commented 4 years ago

As of now we have been manually inspecting and labeling clusters through subjective evaluation of what the most significant themes are within each cluster while looking at the tweets text.

Open question Optimally this can be done automatically after clustering. Solving this along with #3 will allow for a fully automated pipeline from search & sampling to visualization.

Some possible techniques: a) Term frequency analysis of the top-k nearest neighbors to the cluster and sub-cluster centers. The top n non-stopword terms could be put together to form a label.

b) Topic modeling via Latent Dirichlet Allocation (LDA) or Latent Semantic Analysis (LSA) (gensim has APIs for this in python)

c) Assemble a "document" by concatenating the top-k nearest neighbors to the cluster and sub-cluster center and then use deep learning methods (transformers) to create a summary of the "document" to act as the label. Here is a recent neural text summarization model: Text Summarization with Pretrained Encoders

AbrahamSanders commented 4 years ago

Addressed in c98d87d303852f766a6b629061d103fbea77b0fa.

Implemented option (c) from above: we assemble "documents" by concatenating the top-k nearest neighbors to the cluster & subcluster centers. These documents are fed to a huggingface implementation of DistilBART (sshleifer/distilbart-xsum-12-6) for summarization.