dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
849 stars 135 forks source link

Multiple Errors Adapting GloVe Example to Project - Quanteda Related #334

Open sellociompi opened 2 years ago

sellociompi commented 2 years ago

Hello there,

I am having what I believe are multiple issues adapting the GloVe word embeddings tutorial to my project. I am starting with a tokens object created in Quanteda (TOK.Debates.2020.Full.Clean) to create the iterator. However, when I run that first line, I am greeted with this error:

Tokenizer_Debates_2020 = space_tokenizer(TOK.Debates.2020.Full.Clean)

_Warning message: In stringi::stri_split_fixed(strings, pattern = sep, ...) :
  argument is not an atomic vector; coercing_

The tokenizer is created and looks like this:

image

I continue the example with no errors:

Iterator_Debates_2020 = itoken(Tokenizer_Debates_2020) Vocab_Debates_2020 = create_vocabulary(Iterator_Debates_2020) Vocab_Debates_2020 = prune_vocabulary(Vocab_Debates_2020, term_count_min = 10L) Vectorizer_Debates_2020 = vocab_vectorizer(Vocab_Debates_2020) TCM_Debates_2020 = create_tcm(Iterator_Debates_2020, Vectorizer_Debates_2020, skip_grams_window = 5L)

I check the dimensions of the TCM and see that I have rows and columns: dim(TCM_Debates_2020)

_[1] 9277 9277_

I start to fit the model, creating the glove environment with no issue, but when I try to do the actual fitting I obtain the following error:

glove = GlobalVectors$new(rank = 50, x_max = 10) WV_Debates_2020 = glove$fit_transform(TCM_Debates_2020, n_iter = 10, convergence_tol = 0.01, n_threads = 8)

_Error in if (cost/n_nnz > 1) stop("Cost is too big, probably something goes wrong... try smaller learning rate") : 
  missing value where TRUE/FALSE needed_

In order to troubleshoot this error, I have tried to do the following:

WV_Debates_2020 = glove$fit_transform(Debates2020.FCM, n_iter = 10, convergence_tol = 0.01, n_threads = 8)

_Error in glove$fit_transform(Debates2020.FCM, n_iter = 10, convergence_tol = 0.01,  : 
  all(x@x > 0) is not TRUE_

I have been unable to proceed further and obviously one or more of these errors must be the culprit, but I have been unable to find documentation on these errors elsewhere, including past issues catalogued here.

Thank you in advance for any help in taking out this gremlin. -Sello

jwijffels commented 2 years ago

your Tokenizer_Debates_2020 looks like a list of words instead of a list of sequences of words

sellociompi commented 2 years ago

@jwijffels thank you for pointing that out, I've been trying to understand what the difference is, but I'm coming up short, unfortunately.

Would I avoid this problem if I tokenized the original corpus instead of a cleaned tokens item?

jwijffels commented 2 years ago

did you try Iterator_Debates_2020 = itoken(TOK.Debates.2020.Full.Clean, tokenizer = space_tokenizer)