dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
850 stars 135 forks source link

how to reproduce glove result in R #251

Closed hutaohutc closed 6 years ago

hutaohutc commented 6 years ago

I try set.seed() in R,but it failed. I can not reproduce the result . Would you please tell me how to reproduce the result.

dselivanov commented 6 years ago

Hi. Results will be reproducible only if you use 1 thread. Otherwise they will be not reproducible since fitting is done via async SGD without locks (with race conditions).

hutaohutc commented 6 years ago

Thank you for you answer~ but when I use n_threads = 1 in fit_transform function ,I still can not reproduce the result. There is my code :

set.seed(42)
wv_main = glove$fit_transform(tcm, n_iter = 50, convergence_tol = 0.01,n_threads = 1)
dselivanov commented 6 years ago

Actually it seems n_threads has no effect (I've missed to set number of threads equal to n_threads). You can call RcppParallel::setThreadOptions(1) before initializing model:

data("movie_review")
library(text2vec)
it = itoken(movie_review$review, tolower, word_tokenizer)
v = create_vocabulary(it)
v = prune_vocabulary(v, term_count_min = 20)
tcm = create_tcm(it, vocab_vectorizer(v))

RcppParallel::setThreadOptions(1)
set.seed(42)
gl = GloVe$new(word_vectors_size = 50, x_max = 10, vocabulary = v, shuffle = F)
temp1 = gl$fit_transform(tcm, n_iter = 2, n_threads = 1)

set.seed(42)
gl = GloVe$new(word_vectors_size = 50, x_max = 10, vocabulary = v, shuffle = F)
temp2 = gl$fit_transform(tcm, n_iter = 2, n_threads = 1)

identical(temp1, temp2)
# TRUE
tobiasblasberg commented 6 years ago

Hi, I could still not reproduce the vectors, although specifying the number of threads with RcppParallel::setThreadOptions(1) and the seed as recommended. Identical(temp1,temp2) still returns FALSE.

dselivanov commented 6 years ago

Updated example above - call set.seed(42) before each model initialization.