dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
849 stars 135 forks source link

Collocations model error: Not compatible with STRSXP: [type=NULL] #314

Closed leungi closed 4 years ago

leungi commented 4 years ago

Reprex below; tested on R 3.6.2 (Windows) and 3.5.1 (RHEL7), with same outcome.

library(text2vec)

model = Collocations$new()

test_txt = c("i am living in a new apartment in new york city", 
             "new york is the same as new york city", 
             "san francisco is very expensive city", 
             "who claimed that model works?")

it = itoken(test_txt, n_chunks = 1, progressbar = FALSE)

it_phrases = model$transform(it)
#> Error in create_xptr_unordered_set(private$phrases): Not compatible with STRSXP: [type=NULL].

devtools::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.6.2 (2019-12-12)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  English_United States.1252  
#>  ctype    English_United States.1252  
#>  tz       America/Chicago             
#>  date     2020-02-11                  
#> 
#> - Packages -------------------------------------------------------------------
#>  ! package        * version date       lib source        
#>    assertthat       0.2.1   2019-03-21 [1] CRAN (R 3.6.1)
#>    backports        1.1.5   2019-10-02 [1] CRAN (R 3.6.1)
#>    callr            3.4.0   2019-12-09 [1] CRAN (R 3.6.2)
#>    cli              2.0.1   2020-01-08 [1] CRAN (R 3.6.2)
#>    codetools        0.2-16  2018-12-24 [1] CRAN (R 3.6.2)
#>    crayon           1.3.4   2017-09-16 [1] CRAN (R 3.6.1)
#>    data.table       1.12.8  2019-12-09 [1] CRAN (R 3.6.2)
#>    desc             1.2.0   2018-05-01 [1] CRAN (R 3.6.1)
#>    devtools         2.2.1   2019-09-24 [1] CRAN (R 3.6.1)
#>    digest           0.6.23  2019-11-23 [1] CRAN (R 3.6.2)
#>    ellipsis         0.3.0   2019-09-20 [1] CRAN (R 3.6.1)
#>    evaluate         0.14    2019-05-28 [1] CRAN (R 3.6.1)
#>    fansi            0.4.1   2020-01-08 [1] CRAN (R 3.6.2)
#>    foreach          1.4.7   2019-07-27 [1] CRAN (R 3.6.2)
#>    formatR          1.7     2019-06-11 [1] CRAN (R 3.6.2)
#>    fs               1.3.1   2019-05-06 [1] CRAN (R 3.6.1)
#>    futile.logger    1.4.3   2016-07-10 [1] CRAN (R 3.6.2)
#>    futile.options   1.0.1   2018-04-20 [1] CRAN (R 3.6.0)
#>    glue             1.3.1   2019-03-12 [1] CRAN (R 3.6.1)
#>    highr            0.8     2019-03-20 [1] CRAN (R 3.6.2)
#>    htmltools        0.4.0   2019-10-04 [1] CRAN (R 3.6.1)
#>    iterators        1.0.12  2019-07-26 [1] CRAN (R 3.6.2)
#>    knitr            1.27    2020-01-16 [1] CRAN (R 3.6.2)
#>    lambda.r         1.2.4   2019-09-18 [1] CRAN (R 3.6.2)
#>    lattice          0.20-38 2018-11-04 [1] CRAN (R 3.6.2)
#>    magrittr         1.5     2014-11-22 [1] CRAN (R 3.6.1)
#>    Matrix           1.2-18  2019-11-27 [1] CRAN (R 3.6.2)
#>    memoise          1.1.0   2017-04-21 [1] CRAN (R 3.6.1)
#>    mlapi            0.1.0   2017-12-17 [1] CRAN (R 3.6.2)
#>    pkgbuild         1.0.6   2019-10-09 [1] CRAN (R 3.6.1)
#>    pkgload          1.0.2   2018-10-29 [1] CRAN (R 3.6.1)
#>    prettyunits      1.1.1   2020-01-24 [1] CRAN (R 3.6.2)
#>    processx         3.4.1   2019-07-18 [1] CRAN (R 3.6.1)
#>    ps               1.3.0   2018-12-21 [1] CRAN (R 3.6.1)
#>    purrr            0.3.3   2019-10-18 [1] CRAN (R 3.6.1)
#>    R6               2.4.1   2019-11-12 [1] CRAN (R 3.6.2)
#>    Rcpp             1.0.3   2019-11-08 [1] CRAN (R 3.6.2)
#>  D RcppParallel     4.4.4   2019-09-27 [1] CRAN (R 3.6.2)
#>    remotes          2.1.0   2019-06-24 [1] CRAN (R 3.6.1)
#>    rlang            0.4.4   2020-01-28 [1] CRAN (R 3.6.2)
#>    rmarkdown        2.0     2019-12-12 [1] CRAN (R 3.6.2)
#>    rprojroot        1.3-2   2018-01-03 [1] CRAN (R 3.6.1)
#>    sessioninfo      1.1.1   2018-11-05 [1] CRAN (R 3.6.1)
#>    stringi          1.4.5   2020-01-11 [1] CRAN (R 3.6.2)
#>    stringr          1.4.0   2019-02-10 [1] CRAN (R 3.6.1)
#>    testthat         2.3.1   2019-12-01 [1] CRAN (R 3.6.2)
#>    text2vec       * 0.5.1   2018-01-11 [1] CRAN (R 3.6.0)
#>    usethis          1.5.1   2019-07-04 [1] CRAN (R 3.6.1)
#>    withr            2.1.2   2018-03-15 [1] CRAN (R 3.6.1)
#>    xfun             0.12    2020-01-13 [1] CRAN (R 3.6.2)
#>    yaml             2.2.0   2018-07-25 [1] CRAN (R 3.6.0)
#> 
#> [1] C:/Users/Public/Data/R/R-3.6.2/library
#> 
#>  D -- DLL MD5 mismatch, broken installation.
library(text2vec)
#> Warning: package 'text2vec' was built under R version 3.5.3
test_txt = c("i am living in a new apartment in new york city", 
             "new york is the same as new york city", 
             "san francisco is very expensive city", 
             "who claimed that model works?")
it = itoken(test_txt, n_chunks = 1, progressbar = FALSE)
model = Collocations$new()
it_phrases = model$transform(it)
#> Error in create_xptr_unordered_set(private$phrases): Not compatible with STRSXP: [type=NULL].

sessionInfo()
#> R version 3.5.1 (2018-07-02)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Red Hat Enterprise Linux Server 7.3 (Maipo)
#> 
#> Matrix products: default
#> BLAS/LAPACK: /usr/lib64/libopenblasp-r0.3.3.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] text2vec_0.5.1
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.3           mlapi_0.1.0          knitr_1.27          
#>  [4] magrittr_1.5         lattice_0.20-35      R6_2.4.1            
#>  [7] rlang_0.4.2          foreach_1.4.7        stringr_1.4.0       
#> [10] highr_0.8            tools_3.5.1          grid_3.5.1          
#> [13] data.table_1.12.8    xfun_0.12            lambda.r_1.2.4      
#> [16] futile.logger_1.4.3  htmltools_0.4.0      iterators_1.0.12    
#> [19] yaml_2.2.0           RcppParallel_4.4.4   digest_0.6.23       
#> [22] Matrix_1.2-14        formatR_1.7          futile.options_1.0.1
#> [25] codetools_0.2-15     evaluate_0.14        rmarkdown_2.1       
#> [28] stringi_1.4.5        compiler_3.5.1
dselivanov commented 4 years ago

Please consult with ?Collocations. You need first to train model and then you can use transform method. But thanks for reporting - we need produce more meaningful error when transform method called prior to fit method.

leungi commented 4 years ago

Apologies @dselivanov; I should've been more careful to realize that it's impossible to do anything without a fit object 😅

If you may, I can do a PR to edit the docs; am I correct to assume this is where I should be appending the error message to: https://github.com/dselivanov/text2vec/blob/13d0b0349051917bbbbe6a9c2ad3fbe1dd598b9e/R/model_Collocations.R#L242

dselivanov commented 4 years ago

Would really appreciate if you can send PR which adds code to check whether model was fitted.