dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
851 stars 136 forks source link

Question: Asymptotic behavior of loglik from WarpLDA with increasing n_topics #288

Closed manuelbickel closed 5 years ago

manuelbickel commented 5 years ago

Hi,

this might be off-topic since it is not directly a programming question - so please close if I should direclty revert to stackexchange or so. Since text2vec is one of the key packages that uses WarpLDA (the only one?) , I still thought the expertise on this issue might be here.

For identifying the "best" number of topics, several studies use the model with n_topic that has the highest data likelihood, see, e.g., Griffiths and Steyvers, 2004, p.5323, Fig.3.

Probably I am missing something or my understanding is insufficient but it seems that this approach to find the "best" n_topics is not possible with WarpLDA. In several experiments with text2vec/WarpLDA I have run models with n_topics between 1 to 1000 and always observe an asymptotic behaviour of loglik for larger numbers of topics without a peak, see exemplary in this Figure, upper red line (data in this case included n_docs = 30.000 and vocab_size = 24.000; standard text2vec settings for priors were used).

I know that likelihood based selection of the "best" n_topics is not necessarily the best option (therefore we have LDAvis or just recently coherence in text2vec), but I still would be interested to better understand the behaviour of WarpLDA.

I would appreciate any explanations or hints that help me improving my understanding. Thank you in advance!

dselivanov commented 5 years ago

What is your experience with other LDA implementations (for example https://cran.r-project.org/web/packages/lda/)? As I remember from my experience in practice it is almost always the case that likelihood is better with more topics.

On Tue, Dec 18, 2018 at 12:13 AM Manuel Bickel notifications@github.com wrote:

Hi,

this might be off-topic since it is not directly a programming question - so please close if I should direclty revert to stackexchange or so. Since text2vec is one of the key packages that uses WarpLDA (the only one?) , I still thought the expertise on this issue might be here.

For identifying the "best" number of topics, several studies use the model with n_topic that has the highest data likelihood, see, e.g., Griffiths and Steyvers, 2004 https://www.pnas.org/content/101/suppl_1/5228.short, p.5323, Fig.3.

Probably I am missing something or my understanding is insufficient but it seems that this approach to find the "best" n_topics is not possible with WarpLDA. In several experiments with text2vec/WarpLDA I have run models with n_topics between 1 to 1000 and always observe an asymptotic behaviour of loglik for larger numbers of topics without a peak, see exemplary in this Figure https://github.com/manuelbickel/textility/blob/master/coherence_loess_span_025.jpeg, upper red line (data in this case included n_docs = 30.000 and vocab_size = 24.000; standard text2vec settings for priors were used).

I know that likelihood based selection of the "best" n_topics is not necessarily the best option (therefore we have LDAvis or just recently coherence in text2vec), but I still would be interested to better understand the behaviour of WarpLDA.

I would appreciate any explanations or hints that help me improving my understanding. Thank you in advance!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dselivanov/text2vec/issues/288, or mute the thread https://github.com/notifications/unsubscribe-auth/AE4u3XLxbi0Pq322RG7XvRp2ZX7C4BfGks5u6AjrgaJpZM4ZXI6j .

-- Regards Dmitriy Selivanov

manuelbickel commented 5 years ago

Thank you for your quick response. I checked with lda package and it seems that its collapsed Gibbs sampler shows a peak whereas WarpLDA still does not. Below example (runs 1-2 minutes) with movie review data exemplifies this (as far as there are no mistakes in my code). Same behaviour with my data set.

library("lda")
library("text2vec")
data("movie_review")
N = 500
tokens = word_tokenizer(tolower(movie_review$review[1:N]))
it = itoken(tokens, ids = movie_review$id[1:N])
v = create_vocabulary(it)
v = prune_vocabulary(v, term_count_min = 5, doc_proportion_max = 0.2)
dtm = create_dtm(it, vocab_vectorizer(v))
#topic numbers
number = c(5, 50,100,200,300,400,500,600,700)
logliks_lda = rep(0, length(number))
logliks_text2vec = rep(0, length(number))

for (i in 1:length(number)) {
  lda_model = LDA$new(n_topics = i, doc_topic_prior =  0.1, topic_word_prior =  0.1)
  doc_topic_distr = lda_model$fit_transform(dtm, n_iter = 25, n_check_convergence = 10, progressbar = FALSE)
  logliks_text2vec[i] = max(attr(doc_topic_distr, "likelihood")[, 2])
}

dtm_ldac = as.lda_c(dtm)
for (i in 1:length(number)) {
  res = lda.collapsed.gibbs.sampler(dtm_ldac,number[i], colnames(dtm), 25, 0.1, 0.1, compute.log.likelihood=TRUE)
  logliks_lda[i] = max(res$log.likelihoods)
}

plot(number, logliks_text2vec, type = "l")
lines(number, logliks_lda, col = 2)
legend("bottomright", "text2vec in black, lda in red") 
dselivanov commented 5 years ago

To be honest I'm not quite sure why is that - I've written wrapper for this code. I've investigated the pseudo_loglikelihood, but don't remember details...

dselivanov commented 5 years ago

Also I would measure any topic modeling evaluation metric on hold-out dataset, not on train dataset. I think it is important to make sure model can generalize the data.

manuelbickel commented 5 years ago

Thank you for investigating on this issue and your useful hints!

I have just remembered to have once checked the difference between LDA models from topicmodels packages vs text2vec on SO here. While the meaning of resulting topics seems to be similar, it seems WarpLDA removes several words with low topic probability from topics - this might be one reason for its speed and the behaviour regarding loglik (not sure).

I think we can close this for now since likelihood based topic selection is not the best option anyway.