dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
849 stars 135 forks source link

How to cope with multiple documents with GloVe? #312

Closed hope-data-science closed 4 years ago

hope-data-science commented 4 years ago

In the vignette(http://text2vec.org/glove.html) we have learned the excellent example to deal with a single large document. But if we have multiple documents, how can we merge them together and nail it down?

If we simply merge the documents first, we might get more noise because the window is skipping different documents, which is undisirable. Is there a way to handle nicely? Thanks.

hope-data-science commented 4 years ago

I've worked it out, but don't know if it's correct. Codes are listed as below:

library(pacman)
p_load(tidyverse,data.table,text2vec,tokenizers)

data("movie_review")

movie_review %>% 
  as_tibble() %>% 
  select(id,review) %>% 
  mutate(ngram = tokenize_ngrams(review,n = 2)) %>% 
  pull(ngram) -> tokens

it = itoken(tokens, progressbar = FALSE)
vocab = create_vocabulary(it)

vocab = prune_vocabulary(vocab, term_count_min = 5L)

# Use our filtered vocabulary
vectorizer = vocab_vectorizer(vocab)
# use window of 5 for context words
tcm = create_tcm(it, vectorizer, skip_grams_window = 5L)

glove = GlobalVectors$new(word_vectors_size = 50, vocabulary = vocab, x_max = 10)
# `glove` object will be modified by `fit_transform()` call !
wv_main = fit_transform(tcm, glove, n_iter = 10)

wv_context = glove$components
dim(wv_context)
word_vectors = wv_main + t(wv_context)
dselivanov commented 4 years ago

This is correct. But it is not ease to read non-standard R code like pull(ngram) -> tokens.