Open zdarktknight opened 7 years ago
Specifying the parameter "--ngram 3" means that the maximum n-gram size will be 3. So each document will be represented by unigrams, bigrams and trigrams. Since unigrams are far more common than longer ngrams, they will still tend to appear far more frequently in the descriptors of topics produced on many document collections. In the sample data, notice that some bi-grams do appear in the descriptors (e.g. "prime minister").
Regards, Derek.
I want to use n-gram, to build my window topic model:
python prep-text.py data/sample/month1 data/sample/month2 data/sample/month3 -o data --tfidf --norm --ngram 3
python find-window-topics.py data/*.pkl -k 5 -o out
python display-topics.py out/month1_windowtopics_k05.pkl out/month2_windowtopics_k05.pkl out/month3_windowtopics_k05.pkl
when I display the window topics, why it is still unigram?