dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
850 stars 135 forks source link

fit_transform and transform give different results #254

Closed jiunsiew closed 6 years ago

jiunsiew commented 6 years ago

Hi there,

Firstly thanks for your efforts on creating and maintaining such a great package. I've been using it to do some topic modelling.

The 'problem' I'm finding a bit hard to understand is that when using the fit_transform and transform methods in LDA, I get different topic distributions even though the data is the same.

I'm not overly surprised that the results are slightly different due to the generative process of the algorithm but was quite surprised at how the different they were. In particular, when using transform, it looks like topics which had a probability of 0 before, are now non-zero in a non-trivial sense. As such, other topics that were previously non-zero have lower probabilities and the issue is exacerbated when probabilities of topics are rather close.

I've tried forcing the number of iterations higher in the transform and setting the seed before calling transform but it doesn't seem to make much difference.

Is this behaviour to be expected or am I missing something? Would have thought that fitting the same data to the same model would provide very similar results.

I've attached some of the code below that recreates what I'm seeing. Thanks.

## code from http://text2vec.org/topic_modeling.html#latent_dirichlet_allocation
tokens = movie_review$review[1:4000] %>% 
  tolower %>% 
  word_tokenizer
it = itoken(tokens, 
            ids = movie_review$id[1:4000], 
            progressbar = FALSE)

v = create_vocabulary(it) %>% 
  prune_vocabulary(term_count_min = 10, doc_proportion_max = 0.2)

vectorizer = vocab_vectorizer(v)

dtm = create_dtm(it, vectorizer, type = "dgTMatrix")

# create model
lda_model = LDA$new(n_topics = 10, doc_topic_prior = 0.1, topic_word_prior = 0.01)
doc_topic_distr = 
  lda_model$fit_transform(x = dtm, n_iter = 1000, 
                          convergence_tol = 0.001, n_check_convergence = 25, 
                          progressbar = FALSE)

## dominant topic
domTopic <- max.col(doc_topic_distr, ties.method = "first")

# use transform from the old model, with same data used to fit the model previously
new_dtm = itoken(movie_review$review[1:4000], 
                 tolower, 
                 word_tokenizer, 
                 ids = movie_review$id[1:4000]) %>% 
  create_dtm(vectorizer, type = "dgTMatrix")
new_doc_topic_distr = lda_model$transform(new_dtm)

newDomTopic <- max.col(new_doc_topic_distr, ties.method = "first")

diffTopic <- which(domTopic != newDomTopic)  

## compare the two distributions
head(doc_topic_distr)
head(new_doc_topic_distr)

## closer look at the distrbutions where dominant topics differ
doc_topic_distr[diffTopic[1:5],]
new_doc_topic_distr[diffTopic[1:5],]

Here are the results of the last two lines:

> doc_topic_distr[diffTopic[1:5],]
              [,1]       [,2]      [,3]      [,4]      [,5]       [,6]       [,7]       [,8]      [,9]      [,10]
2381_9  0.27272727 0.01515152 0.3939394 0.0000000 0.1060606 0.07575758 0.13636364 0.00000000 0.0000000 0.00000000
7759_3  0.00000000 0.13496933 0.1226994 0.2576687 0.0000000 0.01840491 0.11656442 0.04294479 0.0000000 0.30674847
9495_8  0.07262570 0.07821229 0.1229050 0.2513966 0.0000000 0.00000000 0.09497207 0.14525140 0.1675978 0.06703911
10633_1 0.07272727 0.25454545 0.3090909 0.1818182 0.0000000 0.00000000 0.05454545 0.12727273 0.0000000 0.00000000
8713_10 0.00000000 0.06666667 0.0000000 0.0000000 0.0000000 0.06666667 0.33333333 0.53333333 0.0000000 0.00000000
> new_doc_topic_distr[diffTopic[1:5],]
              [,1]        [,2]       [,3]        [,4]         [,5]        [,6]       [,7]         [,8]        [,9]       [,10]
2381_9  0.32985075 0.001492537 0.31492537 0.001492537 0.1507462687 0.061194030 0.13582090 0.0014925373 0.001492537 0.001492537
7759_3  0.03719512 0.189634146 0.06768293 0.287195122 0.0006097561 0.031097561 0.06768293 0.0006097561 0.037195122 0.281097561
9495_8  0.10055556 0.245000000 0.09500000 0.150555556 0.0005555556 0.067222222 0.05055556 0.0727777778 0.150555556 0.067222222
10633_1 0.23392857 0.269642857 0.21607143 0.001785714 0.0196428571 0.001785714 0.01964286 0.1982142857 0.037500000 0.001785714
8713_10 0.00625000 0.318750000 0.00625000 0.006250000 0.0062500000 0.193750000 0.44375000 0.0062500000 0.006250000 0.006250000
dselivanov commented 6 years ago

Thanks for reporting. There are 2 issues:

  1. In transform we add topic prior and in fit_transform we don't. This is a bug and I will fix it.
  2. Initial state during the fitting and inference depends on the random state. At the moment it is different. I will try to fix it as well.
dselivanov commented 6 years ago

@jiunsiew please try development version

jiunsiew commented 6 years ago

@dselivanov looks good now. Really appreciate the quick fix. Great stuff!