derekgreene / dynamic-nmf

Dynamic Topic Modeling via Non-negative Matrix Factorization
Apache License 2.0
282 stars 87 forks source link

Formula that permits to compute "model coherence" #3

Open TheEdoardo93 opened 7 years ago

TheEdoardo93 commented 7 years ago

Hello everyone! Thanks for the attention!

I'm using this library for an university project that has the scope to analyze the topics in Twitter's data. I used also LDA algorithm for discover topics in tweets. Now I'm interested on using this Dynamic Topic Modeling approach. So, my question is: 1) when I execute the step 2 and 3 in "Advanced Usage", this library returns a model coherence that I can't understand very well because I don't understand which is the formula that is used to compute this value (e.g. model coherence = 0.5923) EXAMPLE: When the library returns this strings listed after, the library returns also a model coherence value. e.g. "Top recommendations for number of topics for 'month1': 6,5,9" ==> So, what is the formula used for this purpose?

P.S. If I have not been too much clear, I can re-write this question more precisely.

Again, thanks a lot for the attention!

Edoardo, an Italian Computer Science student!

derekgreene commented 7 years ago

Hi Edoardo,

The coherence measure implemented is TC-W2V, which was originally proposed in this paper: http://www.sciencedirect.com/science/article/pii/S0957417415001633

The measure involves building a Word2vec model from the full corpus (or a relevant background corpus). For a single topic produced by NMF, the coherence score is the mean pairwise Cosine similarity between the vectors corresponding to the top terms describing the topic. For a full topic model, we compute the mean coherence across all K topics.

Regards, Derek.

TheEdoardo93 commented 7 years ago

Suppose this scenario:

After building word2vec model (step 1), you have to do step 2 and step 3: STEP 2: Suppose you consider "month1".

STEP 3: Suppose the library returns this:

Thanks for the attention! Edoardo

derekgreene commented 7 years ago

Hi Edoardo,

There are 2 stages to model selection:

Regards, Derek.

TheEdoardo93 commented 7 years ago

Two questions: 1) in the first stage, for selecting the number of window topics in each individual time window, which is the formula? For example, the K=3 for month1 is computed as the mean pairwise Cosine similarity between the vectors corresponding to the top terms describing the topic? Okay for selecting the major value returned. 2) in the second stage, which is the formula to compute the optimal value of dynamic topics? Okay for selecting the major value returned.

derekgreene commented 7 years ago

Regarding 1, yes this is correct. Regarding 2, the same formula is used again to rank the possible values of the number of dynamic topics K. In this case, we measure the mean TC-W2V topic coherence for the top terms describing the final set of dynamic topics for each value of K.

sagarpitale95 commented 4 years ago

@derekgreene Can the NMF model calculate the coherence score based on TF-IDF instead of Word2Vec?