dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
852 stars 136 forks source link

Parallelization Issue #205

Closed leungi closed 6 years ago

leungi commented 7 years ago

Hi,

First off, I'm glad to have found this package! Kudos on the focus on speed and ease of functions use.

I tried parallelizing create_dtm and got an output with an accompanying error msg:

Error in if (msg.if.not.empty && is.list(dn) && length(dn) >= 2 && is.character(cn <- dn[[2]]) && : missing value where TRUE/FALSE needed

The dtm is mainly empty, and I initially suspected it has to do with my data. However, I got similar output running the example case:

_> data("movie_review")

it = itoken_parallel(movie_review$review[1:100], n_chunks = N_WORKERS) system.time(dtm <- create_dtm(it, hashvectorizer(2**16), type = 'dgTMatrix')) user system elapsed 0.04 0.11 2.30 dtm 100 x 65536 sparse Matrix of class "dgTMatrix" Error in if (msg.if.not.empty && is.list(dn) && length(dn) >= 2 && is.character(cn <- dn[[2]]) && : missing value where TRUE/FALSE needed

Look forward to your insights. Ivan

dselivanov commented 7 years ago

Hi. Thanks for reporting. Could you provide fully reproducible example with movie_review dataset? also i need to know your sessionInfo().

leungi commented 7 years ago

Thanks for the prompt response Dmitriy.

sessionInfo() R version 3.3.3 (2017-03-06) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1

locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252

attached base packages: [1] parallel stats graphics grDevices utils datasets methods base

other attached packages: [1] doParallel_1.0.10 iterators_1.0.8 foreach_1.4.3 dplyr_0.7.2 purrr_0.2.2.2 readr_1.1.1 tidyr_0.6.3 tibble_1.3.4 ggplot2_2.2.1.9000 [10] tidyverse_1.1.1 data.table_1.10.4 text2vec_0.5.0

loaded via a namespace (and not attached): [1] reshape2_1.4.2 splines_3.3.3 haven_1.0.0 lattice_0.20-34 colorspace_1.3-2 stats4_3.3.3 mgcv_1.8-17 rlang_0.1.1 [9] ModelMetrics_1.1.0 nloptr_1.0.4 foreign_0.8-67 glue_1.1.1 readxl_1.0.0 lambda.r_1.1.9 modelr_0.1.1 bindrcpp_0.2 [17] plyr_1.8.4 bindr_0.1 stringr_1.2.0 MatrixModels_0.4-1 cellranger_1.1.0 munsell_0.4.3 gtable_0.2.0 futile.logger_1.4.3 [25] rvest_0.3.2 codetools_0.2-15 psych_1.7.5 forcats_0.2.0 SparseM_1.76 caret_6.0-73 quantreg_5.29 pbkrtest_0.4-7 [33] broom_0.4.2 Rcpp_0.12.12 scales_0.4.1.9002 RcppParallel_4.3.20 jsonlite_1.5 lme4_1.1-12 mnormt_1.5-5 hms_0.3 [41] digest_0.6.12 stringi_1.1.5 grid_3.3.3 tools_3.3.3 magrittr_1.5 lazyeval_0.2.0 futile.options_1.0.0 car_2.1-4 [49] pkgconfig_2.0.1 MASS_7.3-45 Matrix_1.2-8 xml2_1.1.1 lubridate_1.6.0 assertthat_0.2.0 minqa_1.2.4 httr_1.2.1 [57] R6_2.2.2 compiler_3.3.3 nnet_7.3-12 nlme_3.1-131

data("movie_review")

it = itoken_parallel(movie_review$review[1:100], n_chunks = N_WORKERS)

system.time(dtm <- create_dtm(it, hash_vectorizer(2**16), type = 'dgTMatrix'))

user system elapsed

0.04 0.11 2.30

dtm

100 x 65536 sparse Matrix of class "dgTMatrix"

Error in if (msg.if.not.empty && is.list(dn) && length(dn) >= 2 && is.character(cn <- dn[[2]]) && :

missing value where TRUE/FALSE needed

Ivan

From: Dmitriy Selivanov [mailto:notifications@github.com] Sent: Thursday, September 07, 2017 8:22 AM To: dselivanov/text2vec text2vec@noreply.github.com Cc: Leung, Ivan Ivan_Leung@oxy.com; Author author@noreply.github.com Subject: [EXTERNAL] Re: [dselivanov/text2vec] Parallelization Issue (#205)

Hi. Thanks for reporting. Could you provide fully reproducible example with movie_review dataset? also i need to know your sessionInfo().

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/dselivanov/text2vec/issues/205#issuecomment-327797231, or mute the threadhttps://github.com/notifications/unsubscribe-auth/Ac3xTHJSoDnhJcNGRITLWH2ZzBKOZG0mks5sf-3kgaJpZM4PPx_o.

dselivanov commented 6 years ago

Thanks for reporting. I'm very sorry that it took so long to fix. Issue was not related to parallel processing - there was a minor mistake with character(0) instead of NULL for empty column names in dtm. This caused error during printing, but did not affect anything else.