dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
849 stars 135 forks source link

RelaxedWordMoversDistance resuts are not symmetrical #343

Open oguzozbay opened 1 year ago

oguzozbay commented 1 year ago

I need to calculate similarities of article titles and I intended to use Relaxed Word Mover's Distance. I will use RelaxedWordMoversDistance() function of text2vec R package. After some trial, in my output matix which is showing similarities of titles, I see that RMWD values were not symmetrical.

As I was skeptical of the result I got using my own data, I also tested the example in the vignette. I checked the example in the below adress. https://search.r-project.org/CRAN/refmans/text2vec/html/00Index.html

I checked the example of RelaxedWordMoversDistance function in text2vec is an R package vignette. Then modified example ode and create a larger rwms matrix as follows. rwms = rwmd_model$sim2(dtm)

The diagonals of the matrix are 1. But the elements that are symmetrical with respect to the diagonal are not equal to each other.

Say that i and j are titles.
RelaxedWordMoversDistance[i,j] is not equal to  RelaxedWordMoversDistance[j,i] Is this difference normal or am I doing something wrong? If you can help I would be grateful.

Below is coppied from Vignette: "Package ‘text2vec’ November 30, 2022" Example

Not run:

library(text2vec) library(rsparse) data("movie_review") tokens = word_tokenizer(tolower(movie_review$review)) v = create_vocabulary(itoken(tokens)) v = prune_vocabulary(v, term_count_min = 5, doc_proportion_max = 0.5) it = itoken(tokens) vectorizer = vocab_vectorizer(v) similarities 29 dtm = create_dtm(it, vectorizer) tcm = create_tcm(it, vectorizer, skip_grams_window = 5) glove_model = GloVe$new(rank = 50, x_max = 10) wv = glove_model$fit_transform(tcm, n_iter = 5) wv = wv + t(glove_model$components)

rwmd_model = RelaxedWordMoversDistance$new(dtm, wv) rwms = rwmd_model$sim2(dtm[1:10, ]) head(sort(rwms[1, ], decreasing = T))

End(Not run)