derekgreene / dynamic-nmf

Dynamic Topic Modeling via Non-negative Matrix Factorization
Apache License 2.0
282 stars 87 forks source link

n-gram #5

Open zdarktknight opened 7 years ago

zdarktknight commented 7 years ago

I want to use n-gram, to build my window topic model: python prep-text.py data/sample/month1 data/sample/month2 data/sample/month3 -o data --tfidf --norm --ngram 3 python find-window-topics.py data/*.pkl -k 5 -o out python display-topics.py out/month1_windowtopics_k05.pkl out/month2_windowtopics_k05.pkl out/month3_windowtopics_k05.pkl

when I display the window topics, why it is still unigram?

derekgreene commented 7 years ago

Specifying the parameter "--ngram 3" means that the maximum n-gram size will be 3. So each document will be represented by unigrams, bigrams and trigrams. Since unigrams are far more common than longer ngrams, they will still tend to appear far more frequently in the descriptors of topics produced on many document collections. In the sample data, notice that some bi-grams do appear in the descriptors (e.g. "prime minister").

Regards, Derek.