chocoluffy / redditQA

Explore some interesting NLP experiments with reddit comments data.
2 stars 1 forks source link

9.26 meeting #1

Open chocoluffy opened 7 years ago

chocoluffy commented 7 years ago

☑️ try different topics numbers, N = 500 ☑️ use tf-idf vector instead of bow vector ☑️ print top 10 words for each topic ☑️ examine the bi-graph on user and community, and by propagating the labels, eventually find out the percentage of specialist and generalist for each subreddit community (only examining the most active users.) ; and further, is the comments written by old person or new person [ ] take log into account in 1. plotting.

chocoluffy commented 7 years ago

In terms of tf-idf vector weight. It seems that gensim's lda training only receive BOW matrix. (which we know is bad because it treats every words equally, instead of like tf-idf's different weights for different words). Thus, I use some workarounds. I implement the filtering mechanism, that filters out words that occur less than 20 documents, or more than 50% of the documents. (same as tf-idf's effect), and other than that, I treat every other words equally and feeds to the lda.

So far, the data preprocessing I implement in script:

Great tutorial from gensim officials.

chocoluffy commented 7 years ago

Calculating pairwise topic similarity: Hellinger Distance. Reference here.