Restrict the data analyzed during NMF/LDA

dlab-berkeley / Python-Text-Analysis-Legacy-2023

D-Lab's 12 hour introduction to text analysis with Python. Learn how to perform bag-of-words, sentiment analysis, topic modeling, word embeddings, and more, using scikit-learn, NLTK, Gensim, and spaCy in Python.

Creative Commons Attribution 4.0 International

22 stars 9 forks source link

Restrict the data analyzed during NMF/LDA #20

Closed pssachdeva closed 2 years ago

pssachdeva commented 2 years ago

The 20 newsgroups dataset should be shrunk before being analyzed (as done in the sklearn tutorial). Right now, it uses the entire dataset, which has some nonsense entries that skew the topic models.