dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
850 stars 135 forks source link

collocations should work with itoken_parallel #294

Closed leungi closed 5 years ago

leungi commented 5 years ago

With reference to issue #194, it seems Collocations() still doesn't work perfectly with itoken_parallel.

Reproducible example below.

library(text2vec)
data("movie_review")

preprocessor = function(x) {
  gsub("[^[:alnum:]\\s]", replacement = " ", tolower(x))
}
sample_ind = 1:100
tokens = word_tokenizer(preprocessor(movie_review$review[sample_ind]))

## non-parallel itoken()
it = itoken(tokens, ids = movie_review$id[sample_ind])
system.time(v <- create_vocabulary(it))
#>    user  system elapsed 
#>    0.11    0.00    0.11
v = prune_vocabulary(v, term_count_min = 5)

model = Collocations$new(collocation_count_min = 5, pmi_min = 5)
model$fit(it, n_iter = 2)
#> INFO [2018-12-26 14:46:31] iteration 1 - found 42 collocations
#> INFO [2018-12-26 14:46:31] iteration 2 - found 46 collocations
head(model$collocation_stat)
#>        prefix  suffix n_i n_j n_ij       pmi      lfmd    gensim       llr
#> 1:     jeroen   krabb   5   5    5 11.565197 -11.56520   0.00000  90.16219
#> 2:    special effects   8   7    7 10.887125 -11.27242 541.10714 115.48721
#> 3:     boogey     man   5  25    5  9.243269 -13.88713   0.00000  65.14207
#> 4: could_have    been   8  23    7  9.138369 -12.95607 161.01087  86.90628
#> 5:        hit     man  11  25    7  8.591193 -13.56835 110.18909  77.45678
#> 6:     sexual  scenes  13  19    6  8.523721 -14.08061  61.34008  64.37469

it2 = model$transform(it)
v2 = create_vocabulary(it2)
v2 = prune_vocabulary(v2, term_count_min = 5)
# check what phrases model has learned
head(setdiff(v2$term, v$term))
#> [1] "have_seen"   "don_t_care"  "i_mean"      "my_favorite" "better_than"
#> [6] "more_than"
## parallel itoken()
doParallel::registerDoParallel(2)
it_p = itoken_parallel(tokens, ids = movie_review$id[sample_ind])
system.time(v_p <- create_vocabulary(it_p))
#>    user  system elapsed 
#>    0.03    0.00    1.67
v_p = prune_vocabulary(v_p, term_count_min = 5)

model = Collocations$new(collocation_count_min = 5, pmi_min = 5)
model$fit(it_p, n_iter = 2)
#> INFO [2018-12-26 14:46:35] iteration 1 - found 42 collocations
#> Error in {: task 1 failed - "external pointer is not valid"
dselivanov commented 5 years ago

thanks for reporting and reproducible example - I believe this is the same issue as #293 . Closing in favor of it.

leungi commented 5 years ago

@dselivanov, thanks for prompt update and apologies for the duplication.

Wish the new year brings more exciting front for #text2vec!