dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
849 stars 135 forks source link

dist2() with RWMD method bug/error? #319

Closed linneaturco closed 4 years ago

linneaturco commented 4 years ago

I'm trying to use dist2() with a RWMD model method to calculate distances, but continue to receive an error that the RWMD method is an "unused argument". All the documentation for the package still contains the ability to use an RWMD model as the dist2() method, so I'm wondering whether there's a problem on my end?

I've included code below which I hope might be helpful, and many thanks for any help you can provide.

text_corpus <- corpus(txtstatesyears$text, 
                      docvars = docvars(txtstatesyears))
text_tokens <- tokens(text_corpus)
feats <- dfm(text_tokens, verbose = TRUE) %>% 
  dfm_trim(min_termfreq = 4, min_docfreq = .25, docfreq_type = "prop") %>% 
  featnames()

text_tokens <- tokens_select(text_tokens, feats, padding = TRUE)

text_fcm <- fcm(text_tokens, context = "window", count = "weighted", weights = 1/(1:5), tri = TRUE)

glove <- GlobalVectors$new(rank = 50, x_max = 10)
word_vector <- glove$fit_transform(text_fcm, n_iter = 10,convergence_tol=0.01, n_threads = 8)

wv_context <- glove$components
averaged_word_vectors <- word_vector + t(wv_context)

rwmd <- RelaxedWordMoversDistance$new(text_fcm, averaged_word_vectors)

un.rwmd.l <- list()
for(i in 1993:2018){
  print(i)
  sub.i <- corpus_subset(text_corpus, year == i)
  tokens <- word_tokenizer(tolower(sub.i))
  it <- itoken(tokens)
  v <- create_vocabulary(it)
  vectorizer <- vocab_vectorizer(v)
  ir_dtm <- create_dtm(it, vectorizer)

  rwmd_dist <- dist2(ir_dtm, method = rwmd, norm = "none")
  rwmd_norm <- (rwmd_dist-min(rwmd_dist))/(max(rwmd_dist)-min(rwmd_dist))
  rwmd_norm_sims <- 1 - rwmd_norm

  diag(rwmd_norm_sims) <- 0
  colnames(rwmd_norm_sims) <- docvars(sub.i, "country")
  rownames(rwmd_norm_sims) <- docvars(sub.i, "country")

  un.rwmd.l[[i]] <- rwmd_norm_sims
}
dselivanov commented 4 years ago

Hi @linneaturco . Thanks for report, it seem I need to update documentation. Can you try rwmd$sim2(x)/rwmd$dist2(x) functions from RelaxedWordMoversDistance class? see examples in ?RelaxedWordMoversDistance

linneaturco commented 4 years ago

Thanks @dselivanov. I am able to run rwmd$dist2(x) and I should have realized that the dist2() documentation was obsolete for RWMD sooner.

dselivanov commented 4 years ago

see example in #326