dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
849 stars 135 forks source link

RelaxedWordMoversDistance on version 0.6 #326

Closed meltoner closed 4 years ago

meltoner commented 4 years ago

Hello just to report that i got very different results while using version 0.6. and the RelaxedWordMoversDistance metric. than in version 0.51 I had to revert to the previous version in order to get results that made sense. based on the news, i understand that the rmwd implementation was changed but not tested.

dselivanov commented 4 years ago

The discrepancy is due to the fact that RWMD is not symmetric. This means rwmd(a, b) != rwmd(b, a). In 0.5.1 we took min(rwmd(a, b), rwmd(b, a)) which I believe is not entirely correct.

In 0.6 calculation is explicit - we calculate distance from a query document to a predefined collection of documents.

meltoner commented 4 years ago

Thank you Dselivanov for the info! will give it another go knowing this

meltoner commented 4 years ago

tried to use the oposite way as follows

rwmd_model <- RelaxedWordMoversDistance$new(dtm, wv) rwmd_dist <- rwmd_model$sim2(dtm)

and instead of getting the disances by row getThresholdOrderRwmd(rwmd_dist[i,], documentsIndex$id, floatingError, topItems) got them by column getThresholdOrderRwmd(rwmd_dist[, i], documentsIndex$id, floatingError, topItems)

but still the results don't make sense compared to the previous version 1.51

have no clue what to tryout, but remain on version 1.51 in case you have more clues to share i would be greatefull or in case you missed some feedback on rwmd and latest version, you have in any case thank you for this great package and i am very aware how dificult it is to maintain it thanks!

dselivanov commented 4 years ago

here is how you can reproduce 0.5.1 using 0.6

library(text2vec)

data("movie_review")

it = itoken(movie_review$review, preprocessor = tolower, tokenizer = word_tokenizer)
v = create_vocabulary(it)
v = prune_vocabulary(v, term_count_min = 5, doc_proportion_min = 0.01, doc_proportion_max = 0.9)

vv = vocab_vectorizer(v)
dtm = create_dtm(it, vv)
tcm = create_tcm(it, vv)

set.seed(1)
lsa = LSA$new(n_topics = 16)
de = lsa$fit_transform(dtm)
wv = t(lsa$components)

rwmd = RWMD$new(x = dtm, embeddings = wv)
d12 = rwmd$dist2(dtm[1:10, ])

rwmd = RWMD$new(x = dtm[1:10, ], embeddings = wv)
d21 = rwmd$dist2(dtm)

# d will be the same as following values 
# caclulated with text2vec 0.5.1 and cosine distance
# rwmd$dist2(dtm[1:10, ], dtm)
# rwmd$dist2(dtm, dtm[1:10, ])

d = pmax(d12, t(d21))
meltoner commented 4 years ago

Thank you so much!