bnosac / word2vec

Distributed Representations of Words using word2vec
Apache License 2.0
69 stars 5 forks source link

Training `word2vec` model fails on Fedora Linux #12

Open steffen-stell opened 1 year ago

steffen-stell commented 1 year ago

Training any word2vec() model fails on Fedora 37 with the binary from the iucar/cran COPR repository. I first reported the problem there, but the maintainer makes clear that it is a bug in the word2vec package. He has posted some first insights in the issue.

jwijffels commented 1 year ago

Hello thanks for the report.

Questions: I don't have a machine with Fedora. Is this reproducible on rhub or how can I reproduce this - I only have Windows and Ubuntu. Does this out of bounds happen when you do the standard data processing provided in the word2vec package (txt_clean_word2vec)

steffen-stell commented 1 year ago

Thanks for the quick response.

I don't have a machine with Fedora. Is this reproducible on rhub or how can I reproduce this - I only have Windows and Ubuntu.

I have no experience with r-hub, so I can't tell you if you can reproduce it there. It should be fairly easy to reproduce in a docker container or a VM. R on Fedora is not difficult to set up. A docker file to build an image to reproduce this would look like this:

FROM fedora:37
RUN dnf update -y
RUN dnf install 'dnf-command(copr)' -y
RUN dnf copr enable iucar/cran -y
RUN dnf install R-CoprManager R-CRAN-quanteda R-CRAN-word2vec -y

I didn't even know that you could pass a quanteda corpus to the function.

The corpus class objects from quanteda are S3 objects of type character. So to any function that is not an S3 generic with a method for corpus objects, it is just a character vector with a bunch of attributes. I originally encountered this problem with another corpus. That was not a quanteda corpus object. I was just looking for a built-in dataset to make a quick reproducible example.

Does this out of bounds happen when you do the standard data processing provided in the word2vec package (txt_clean_word2vec)

I've tried to do this with txt_clean_word2vec(), but it still fails.

Draic commented 1 month ago

Trying to use the word2vec package on Arch Linux also does not work. I don't know if for the same reasons as reported before linux-crash

> sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: EndeavourOS

Matrix products: default
BLAS/LAPACK: /usr/lib/libopenblas.so.0.3;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Berlin
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] word2vec_0.4.0 udpipe_0.8.11 

loaded via a namespace (and not attached):
 [1] Matrix_1.7-0      miniUI_0.1.1.1    compiler_4.4.1    promises_1.3.0    Rcpp_1.0.13       stringr_1.5.1    
 [7] later_1.3.2       yaml_2.3.10       fastmap_1.2.0     lattice_0.22-6    mime_0.12         R6_2.5.1         
[13] knitr_1.48        htmlwidgets_1.6.4 profvis_0.3.8     shiny_1.8.1.1     rlang_1.1.4       cachem_1.1.0     
[19] stringi_1.8.4     httpuv_1.6.15     xfun_0.46         fs_1.6.4          pkgload_1.4.0     memoise_2.0.1    
[25] cli_3.6.3         magrittr_2.0.3    grid_4.4.1        digest_0.6.36     rstudioapi_0.16.0 xtable_1.8-4     
[31] remotes_2.5.0     devtools_2.4.5    lifecycle_1.0.4   vctrs_0.6.5       data.table_1.15.4 evaluate_0.24.0  
[37] glue_1.7.0        urlchecker_1.0.1  sessioninfo_1.2.2 pkgbuild_1.4.4    rmarkdown_2.27    purrr_1.0.2      
[43] tools_4.4.1       usethis_3.0.0     ellipsis_0.3.2    htmltools_0.5.8.1