9.26 meeting - Githubissues

chocoluffy commented 7 years ago

☑️ try different topics numbers, N = 500 ☑️ use tf-idf vector instead of bow vector ☑️ print top 10 words for each topic ☑️ examine the bi-graph on user and community, and by propagating the labels, eventually find out the percentage of specialist and generalist for each subreddit community (only examining the most active users.) ; and further, is the comments written by old person or new person [ ] take log into account in 1. plotting.

chocoluffy commented 7 years ago

In terms of tf-idf vector weight. It seems that gensim's lda training only receive BOW matrix. (which we know is bad because it treats every words equally, instead of like tf-idf's different weights for different words). Thus, I use some workarounds. I implement the filtering mechanism, that filters out words that occur less than 20 documents, or more than 50% of the documents. (same as tf-idf's effect), and other than that, I treat every other words equally and feeds to the lda.

So far, the data preprocessing I implement in script:

remove stopwords and characters with length < 1.
remove punctuation.
tokenization. use the WordNet lemmatizer from NLTK instead of normal stemmer.
add bigram to documents if it appears more than 20 times.
filter extremes(similar to tf-idf), filters out words that occur less than 20 documents, or more than 50% of the documents.

Great tutorial from gensim officials.

chocoluffy commented 7 years ago

Calculating pairwise topic similarity: Hellinger Distance. Reference here.

chocoluffy / redditQA

9.26 meeting #1