dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
853 stars 136 forks source link

parallel version of create_tcm doesn't work #296

Closed DavidArenburg closed 5 years ago

DavidArenburg commented 5 years ago

Reproducible example from the docs

library(text2vec)
data("movie_review")

# set to number of cores on your machine
N_WORKERS = 4
if(require(doParallel)) registerDoParallel(N_WORKERS)
splits = split_into(movie_review$review, N_WORKERS)
jobs = lapply(splits, itoken, tolower, word_tokenizer)
v = create_vocabulary(jobs)
# Warning message:
#   'create_vocabulary.list' is deprecated.
# Use 'create_vocabulary.itoken_parallel()' instead.
# See help("Deprecated") 

vectorizer = vocab_vectorizer(v)
jobs = lapply(splits, itoken, tolower, word_tokenizer)
tcm = create_tcm(jobs, vectorizer, skip_grams_window = 3L, skip_grams_window_context = "symmetric")
# Error in UseMethod("create_tcm") : 
#   no applicable method for 'create_tcm' applied to an object of class "list"

It looks like jobs is supposed to be something else rather a list , but I can't seem to find how to create it otherwise.

sessionInfo()
# R version 3.5.1 (2018-07-02)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows >= 8 x64 (build 9200)
# 
# Matrix products: default
# 
# locale:
# [1] LC_COLLATE=English_Israel.1252  LC_CTYPE=English_Israel.1252    LC_MONETARY=English_Israel.1252 LC_NUMERIC=C                   
# [5] LC_TIME=English_Israel.1252    
# 
# attached base packages:
# [1] parallel  stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] text2vec_0.5.1    doParallel_1.0.14 iterators_1.0.10  foreach_1.4.4    
# 
# loaded via a namespace (and not attached):
# [1] Rcpp_1.0.0           lattice_0.20-35      codetools_0.2-15     digest_0.6.18        grid_3.5.1           R6_2.3.0             futile.options_1.0.1
# [8] formatR_1.5          RcppParallel_4.4.2   data.table_1.11.8    futile.logger_1.4.3  Matrix_1.2-14        lambda.r_1.2.3       tools_3.5.1         
# [15] mlapi_0.1.0          compiler_3.5.1      
dselivanov commented 5 years ago

Thanks for reporting. Unfortunately I will not fix this - all high level parallel computing will be dropped on Windows in the next release. Please consider to use serial version - it is not much slower than parallel one on Windows.

вс, 3 февр. 2019 г., 18:27 David Arenburg notifications@github.com:

Reproducible example from the docs

library(text2vec) data("movie_review")

set to number of cores on your machineN_WORKERS = 4if(require(doParallel)) registerDoParallel(N_WORKERS)splits = split_into(movie_review$review, N_WORKERS)jobs = lapply(splits, itoken, tolower, word_tokenizer)v = create_vocabulary(jobs)# Warning message:# 'create_vocabulary.list' is deprecated.# Use 'create_vocabulary.itoken_parallel()' instead.# See help("Deprecated")

vectorizer = vocab_vectorizer(v)jobs = lapply(splits, itoken, tolower, word_tokenizer)tcm = create_tcm(jobs, vectorizer, skip_grams_window = 3L, skip_grams_window_context = "symmetric")# Error in UseMethod("create_tcm") : # no applicable method for 'create_tcm' applied to an object of class "list"

It looks like jobs is supposed to be something else rather a list , but I can't seem to find how to create it otherwise.

sessionInfo()# R version 3.5.1 (2018-07-02)# Platform: x86_64-w64-mingw32/x64 (64-bit)# Running under: Windows >= 8 x64 (build 9200)# # Matrix products: default# # locale:# [1] LC_COLLATE=English_Israel.1252 LC_CTYPE=English_Israel.1252 LC_MONETARY=English_Israel.1252 LC_NUMERIC=C # [5] LC_TIME=English_Israel.1252 # # attached base packages:# [1] parallel stats graphics grDevices utils datasets methods base # # other attached packages:# [1] text2vec_0.5.1 doParallel_1.0.14 iterators_1.0.10 foreach_1.4.4 # # loaded via a namespace (and not attached):# [1] Rcpp_1.0.0 lattice_0.20-35 codetools_0.2-15 digest_0.6.18 grid_3.5.1 R6_2.3.0 futile.options_1.0.1# [8] formatR_1.5 RcppParallel_4.4.2 data.table_1.11.8 futile.logger_1.4.3 Matrix_1.2-14 lambda.r_1.2.3 tools_3.5.1 # [15] mlapi_0.1.0 compiler_3.5.1

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dselivanov/text2vec/issues/296, or mute the thread https://github.com/notifications/unsubscribe-auth/AE4u3XbhPAqDgiPS_nr2qWNYBRHlus0mks5vJvHAgaJpZM4agIdj .

DavidArenburg commented 5 years ago

OK, that's fine. I had glove$fit_transform crushing RStudio, so I though I'll need to parallelise , but eventually setting n_chunks = to a higher value solved the issue.

Thanks for the package btw. You are doing a great job. Any planning to add word2vec too or you left it to the wordVectors package?

dselivanov commented 5 years ago

Glove and word2vec usually give very similar results, so I don't see much value working on it.

вс, 3 февр. 2019 г., 19:24 David Arenburg notifications@github.com:

Closed #296 https://github.com/dselivanov/text2vec/issues/296.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dselivanov/text2vec/issues/296#event-2114355655, or mute the thread https://github.com/notifications/unsubscribe-auth/AE4u3T-tL9ROpqHVcVedKUVRs8ep3b5Eks5vJv8wgaJpZM4agIdj .

RezaSadeghiWSU commented 5 years ago

I faced with the similar issue in Ubuntu. Do you have any suggestion?

Regards, Reza

dselivanov commented 5 years ago

@RezaSadeghiWSU please provide reproducible example, otherwise I can't help.

Following code work on my ubuntu machine and text2vec 0.5.1:

library(text2vec, lib.loc = "~/temp/")
data("movie_review")

# set to number of cores on your machine
N_WORKERS = 4
if(require(doParallel)) registerDoParallel(N_WORKERS)
jobs = itoken_parallel(movie_review$review, tolower, word_tokenizer, n_chunks = N_WORKERS, ids = movie_review$id)
v = create_vocabulary(jobs)
vectorizer = vocab_vectorizer(v)
tcm = create_tcm(jobs, vectorizer, skip_grams_window = 3L, skip_grams_window_context = "symmetric")