dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
850 stars 135 forks source link

Seems change since Cran publication had a big impact on LDA #286

Closed pommedeterresautee closed 5 years ago

pommedeterresautee commented 5 years ago

I switch from the last Cran published version of text2vec to source installation (change nothing else). LDA results are very different. Source version are of a much lower quality (by far).

Do you have any idea of source of change in LDA?

dselivanov commented 5 years ago

Not sure why do you use Tf-idf - LDA designed to to work on word counts. Change related to tf-idf is here - https://github.com/dselivanov/text2vec/commit/54bb54204e11cef7c95be38c93818a532e1d4e77

There were changes related to inference in LDA - see #254 and https://github.com/dselivanov/text2vec/commit/2da1f3d1da4b484d064521345983f9991ec7842b. Now we draw several samples from word-topic distribution (controlled by n_iter_inference parameter) - on CRAN version we only draw 1 sample.

How do you measure quality? would love to see some reproducible example, mb there is some bug.

pommedeterresautee commented 5 years ago

You are right, it s obviously not related to TF IDF, I realized I was totally wrong just after opening the issue, and modified it, but obviously, it was too late... If I set n_iter_inference to 1, should I get the old behaviour? I tried but didn't work. I can see the difference when looking at the top document in each topic (the one having the biggest value in each topic). With very same parameters / data / code, before, whatever the topic, the selected paragraph was quite long, now, they are very short. And the interpretability of the topics has degraded.

Kind regards, Michaël

Edit: making more tests, I just noticed something interesting, I do fit_transform then transform on the same data. fit_transform alone provides approx the same results with both old and new version of text2vec. Only the call to transform is different between the 2 versions. In some way I need to reproduce the old behaviour when calling transform on same data than fit_transform doesn't provide the same result.

pommedeterresautee commented 5 years ago

With old text2vec results are very different, whatever the axe (but the second in my case, where they are just similar). I think it s related to prior selection but I don't yet understand how it impacts average length of paragraph the most representative of a topic. I want to keep this behaviour as it seems to me that it has a better quality.

library(text2vec)
library(magrittr)

tokens = movie_review$review[1:4000] %>% 
  tolower %>% 
  word_tokenizer
it = itoken(tokens, 
            ids = movie_review$id[1:4000], 
            progressbar = FALSE)

v = create_vocabulary(it) %>% 
  prune_vocabulary(term_count_min = 10, doc_proportion_max = 0.2)

vectorizer = vocab_vectorizer(v)

dtm = create_dtm(it, vectorizer, type = "dgTMatrix")

# create model
lda_model = LDA$new(n_topics = 50)
doc_topic_distr = lda_model$fit_transform(x = dtm, n_iter = 1000, 
                                          convergence_tol = 0.001, n_check_convergence = 25, 
                                          progressbar = FALSE)

# use transform from the old model, with same data used to fit the model previously
new_doc_topic_distr = lda_model$transform(dtm)

# select most representative paragraph
selected_old = order(doc_topic_distr[,1], decreasing = T)[1:10]
selected_new = order(new_doc_topic_distr[,1], decreasing = T)[1:10]

# compare length
print(mean(stringi::stri_count_words(movie_review$review[selected_old])))
print(mean(stringi::stri_count_words(movie_review$review[selected_new])))

With Cran published

> print(mean(stringi::stri_count_words(movie_review$review[selected_old])))
[1] 357.6
> print(mean(stringi::stri_count_words(movie_review$review[selected_new])))
[1] 589.1

With last version

> print(mean(stringi::stri_count_words(movie_review$review[selected_old])))
[1] 264.6
> print(mean(stringi::stri_count_words(movie_review$review[selected_new])))
[1] 262.5
dselivanov commented 5 years ago

it seems there is some issue during inference phase, but I didn't figure out where. Perplexity is higher while using new code.

pommedeterresautee commented 5 years ago

Have you an idea of the source of issue? I can make some tests if required.

dselivanov commented 5 years ago

All the changes are in this commit https://github.com/dselivanov/text2vec/commit/2da1f3d1da4b484d064521345983f9991ec7842b - need to debug

dselivanov commented 5 years ago

@pommedeterresautee I've done some tests and it seems current version produce pretty same results as cran version - perplexity is very similar:

# CRAN VERSION
# install.packages("text2vec", "~/temp")
# library(text2vec, lib.loc = "~/temp/")

# DEV VERSION - comment when try cran version
library(text2vec)
data("movie_review")

it = itoken(movie_review$review[1:4000], preprocessor = tolower, tokenizer = word_tokenizer)
it2 = itoken(movie_review$review[4001:5000], preprocessor = tolower, tokenizer = word_tokenizer)

v = create_vocabulary(it)
v = prune_vocabulary(v, term_count_min = 20)
vv = vocab_vectorizer(v)

dtm = create_dtm(it, vv)
dtm2 = create_dtm(it2, vv)

lda = LDA$new(n_topics = 10, doc_topic_prior = 5, topic_word_prior = 0.1)
res = lda$fit_transform(dtm, n_iter = 200, convergence_tol = -1, n_check_convergence = 200)

dt_distr = lda$transform(dtm2, n_iter = 50, n_check_convergence = 50)
perplexity(dtm2, topic_word_distribution = lda$topic_word_distribution, doc_topic_distribution = dt_distr)
# ~440
pommedeterresautee commented 5 years ago

Are you using branch 0.6? Seems there is no change in my test (code above). With Cran, mean length: 248 for fit_transform, and 446 for transform. With branch 0.6, it's 263 then 245. It's not random results, I have almost always the same behaviour whatever the dataset and the LDA dimension I choose (with few exceptions for the choice of the dim).

pommedeterresautee commented 5 years ago

I have made many tests since my last message (including with other LDA implementations like Mallet). My feeling is that 0.5 text2vec version has a bug improving the perceived quality of LDA. Other implementations tends to work like text2vec 0.6

The new version tends to give much higher scores to short and pure documents (all words are related to the same topic) where the oldest version gives high scores to long documents even if few words are not related to the same topic.

Removing short documents with the new version helps even if it s not the same thing (it still choose the shortest documents it can find for very high scores on a specific topic).

I am closing the issue. Feel free to re open it if you think it makes sense to work on the point.

dselivanov commented 5 years ago

Thanks a lot for report and investigation!

On Thu, Mar 21, 2019 at 10:31 AM Michaël Benesty notifications@github.com wrote:

Closed #286 https://github.com/dselivanov/text2vec/issues/286.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dselivanov/text2vec/issues/286#event-2218874565, or mute the thread https://github.com/notifications/unsubscribe-auth/AE4u3UGPHBAz9dE5A5pPUNTyI9maeFazks5vYycngaJpZM4YyGZS .

-- Regards Dmitriy Selivanov