Open steffen-stell opened 1 year ago
Hello thanks for the report.
Questions:
I don't have a machine with Fedora. Is this reproducible on rhub or how can I reproduce this - I only have Windows and Ubuntu.
Does this out of bounds happen when you do the standard data processing provided in the word2vec package (txt_clean_word2vec
)
Thanks for the quick response.
I don't have a machine with Fedora. Is this reproducible on rhub or how can I reproduce this - I only have Windows and Ubuntu.
I have no experience with r-hub
, so I can't tell you if you can reproduce it there. It should be fairly easy to reproduce in a docker container or a VM. R
on Fedora is not difficult to set up. A docker file to build an image to reproduce this would look like this:
FROM fedora:37
RUN dnf update -y
RUN dnf install 'dnf-command(copr)' -y
RUN dnf copr enable iucar/cran -y
RUN dnf install R-CoprManager R-CRAN-quanteda R-CRAN-word2vec -y
I didn't even know that you could pass a quanteda corpus to the function.
The corpus
class objects from quanteda
are S3
objects of type character
. So to any function that is not an S3 generic with a method for corpus
objects, it is just a character
vector with a bunch of attributes. I originally encountered this problem with another corpus. That was not a quanteda corpus object. I was just looking for a built-in dataset to make a quick reproducible example.
Does this out of bounds happen when you do the standard data processing provided in the word2vec package (
txt_clean_word2vec
)
I've tried to do this with txt_clean_word2vec()
, but it still fails.
Trying to use the word2vec package on Arch Linux also does not work. I don't know if for the same reasons as reported before
> sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: EndeavourOS
Matrix products: default
BLAS/LAPACK: /usr/lib/libopenblas.so.0.3; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=de_DE.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Berlin
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] word2vec_0.4.0 udpipe_0.8.11
loaded via a namespace (and not attached):
[1] Matrix_1.7-0 miniUI_0.1.1.1 compiler_4.4.1 promises_1.3.0 Rcpp_1.0.13 stringr_1.5.1
[7] later_1.3.2 yaml_2.3.10 fastmap_1.2.0 lattice_0.22-6 mime_0.12 R6_2.5.1
[13] knitr_1.48 htmlwidgets_1.6.4 profvis_0.3.8 shiny_1.8.1.1 rlang_1.1.4 cachem_1.1.0
[19] stringi_1.8.4 httpuv_1.6.15 xfun_0.46 fs_1.6.4 pkgload_1.4.0 memoise_2.0.1
[25] cli_3.6.3 magrittr_2.0.3 grid_4.4.1 digest_0.6.36 rstudioapi_0.16.0 xtable_1.8-4
[31] remotes_2.5.0 devtools_2.4.5 lifecycle_1.0.4 vctrs_0.6.5 data.table_1.15.4 evaluate_0.24.0
[37] glue_1.7.0 urlchecker_1.0.1 sessioninfo_1.2.2 pkgbuild_1.4.4 rmarkdown_2.27 purrr_1.0.2
[43] tools_4.4.1 usethis_3.0.0 ellipsis_0.3.2 htmltools_0.5.8.1
Training any
word2vec()
model fails on Fedora 37 with the binary from theiucar/cran
COPR repository. I first reported the problem there, but the maintainer makes clear that it is a bug in theword2vec
package. He has posted some first insights in the issue.