bstewart / stm

An R Package for the Structural Topic Model
Other
400 stars 98 forks source link

Errors in using searchK() from a quanteda dfm approach #198

Open jooyoungseo opened 5 years ago

jooyoungseo commented 5 years ago

Please check wehtehr users can employ quanteda's dfm object for searchK() function. Apparently, there seems an issue:

library(stm)
#> stm v1.3.3 () successfully loaded. See ?stm for help. 
#>  Papers, resources, and other materials at structuraltopicmodel.com
library(quanteda)
#> Package version: 1.4.5
#> Parallel computing: 2 of 4 threads used.
#> See https://quanteda.io for tutorials and examples.
#> 
#> Attaching package: 'quanteda'
#> The following object is masked from 'package:utils':
#> 
#>     View

gadarian_corpus <- corpus(gadarian, text_field = "open.ended.response")

gadarian_dfm <- dfm(gadarian_corpus, 
                     remove = stopwords("english"),
                     stem = TRUE)

meta <- docvars(gadarian_corpus)

set.seed(02138)
kresult <- searchK(documents = gadarian_dfm, K = c(5,10,15), prevalence = ~treatment + s(pid_rep), data = meta)
#> Error in sample.int(length(x), size, replace, prob): cannot take a sample larger than the population when 'replace = FALSE'

Created on 2019-05-17 by the reprex package (v0.3.0.9000)

Session info ``` r devtools::session_info() #> - Session info ---------------------------------------------------------- #> setting value #> version R version 3.6.0 (2019-04-26) #> os Windows 10 x64 #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate English_United States.1252 #> ctype English_United States.1252 #> tz America/New_York #> date 2019-05-17 #> #> - Packages -------------------------------------------------------------- #> ! package * version date lib #> assertthat 0.2.1 2019-03-21 [1] #> backports 1.1.4 2019-04-10 [1] #> callr 3.2.0 2019-03-15 [1] #> cli 1.1.0 2019-03-19 [1] #> colorspace 1.4-1 2019-03-18 [1] #> crayon 1.3.4 2017-09-16 [1] #> data.table 1.12.3 2019-05-15 [1] #> desc 1.2.0 2018-05-01 [1] #> devtools 2.0.2.9000 2019-05-13 [1] #> digest 0.6.18 2018-10-10 [1] #> dplyr 0.8.0.9014 2019-05-06 [1] #> evaluate 0.13 2019-02-12 [1] #> fastmatch 1.1-0 2017-01-28 [1] #> fs 1.3.1 2019-05-06 [1] #> ggplot2 3.1.1.9000 2019-05-17 [1] #> glue 1.3.1.9000 2019-05-01 [1] #> gtable 0.3.0 2019-03-25 [1] #> highr 0.8 2019-03-20 [1] #> htmltools 0.3.6 2017-04-28 [1] #> ISOcodes 2019.04.22 2019-04-23 [1] #> knitr 1.22.12 2019-05-17 [1] #> lattice 0.20-38 2018-11-04 [1] #> lazyeval 0.2.2 2019-03-15 [1] #> lubridate 1.7.4.9000 2019-05-01 [1] #> magrittr 1.5 2014-11-22 [1] #> Matrix 1.2-17 2019-03-22 [1] #> memoise 1.1.0 2017-04-21 [1] #> munsell 0.5.0 2018-06-12 [1] #> pillar 1.4.0 2019-05-11 [1] #> pkgbuild 1.0.3 2019-03-20 [1] #> pkgconfig 2.0.2 2018-08-16 [1] #> pkgload 1.0.2 2018-10-29 [1] #> prettyunits 1.0.2 2015-07-13 [1] #> processx 3.3.1 2019-05-08 [1] #> ps 1.3.0 2018-12-21 [1] #> purrr 0.3.2.9000 2019-04-27 [1] #> quanteda * 1.4.5 2019-05-13 [1] #> R6 2.4.0 2019-02-14 [1] #> Rcpp 1.0.1.3 2019-04-27 [1] #> D RcppParallel 4.4.2 2018-12-11 [1] #> remotes 2.0.4.9000 2019-05-13 [1] #> rlang 0.3.4.9003 2019-05-01 [1] #> rmarkdown 1.12.8 2019-05-15 [1] #> rprojroot 1.3-2 2018-01-03 [1] #> scales 1.0.0 2018-08-09 [1] #> sessioninfo 1.1.1 2018-11-05 [1] #> SnowballC 0.6.0 2019-01-15 [1] #> spacyr 1.1 2019-05-13 [1] #> stm * 1.3.3 2019-04-27 [1] #> stopwords 0.9.0 2017-12-14 [1] #> stringi 1.4.3 2019-03-12 [1] #> stringr 1.4.0.9000 2019-05-15 [1] #> testthat 2.1.1 2019-04-23 [1] #> tibble 2.1.1 2019-03-16 [1] #> tidyselect 0.2.5.9000 2019-04-27 [1] #> usethis 1.5.0.9000 2019-05-13 [1] #> vctrs 0.1.0.9003 2019-05-17 [1] #> withr 2.1.2 2018-03-15 [1] #> xfun 0.7 2019-05-14 [1] #> yaml 2.2.0 2018-07-25 [1] #> zeallot 0.1.0 2018-01-28 [1] #> source #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> Github (Rdatatable/data.table@93f50f7) #> CRAN (R 3.6.0) #> Github (r-lib/devtools@92d32cb) #> CRAN (R 3.6.0) #> Github (hadley/dplyr@9c6f59e) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> Github (hadley/ggplot2@1f6f0cb) #> Github (tidyverse/glue@ea0edcb) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> Github (yihui/knitr@f85bce0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> Github (hadley/lubridate@99e2af3) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> Github (hadley/purrr@25d84f7) #> Github (quanteda/quanteda@29d9fd3) #> CRAN (R 3.6.0) #> Github (RcppCore/Rcpp@6062d56) #> CRAN (R 3.6.0) #> Github (r-lib/remotes@ba2f034) #> Github (r-lib/rlang@6a232c0) #> Github (rstudio/rmarkdown@62ab411) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> Github (quanteda/spacyr@4d1373d) #> Github (bstewart/stm@525b00c) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> Github (hadley/stringr@0b90f91) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> Github (tidyverse/tidyselect@19150c0) #> Github (r-lib/usethis@dced164) #> Github (r-lib/vctrs@cd0e31e) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> #> [1] C:/Program Files/R/R-3.6.0/library #> #> D -- DLL MD5 mismatch, broken installation. ```
meier-flo commented 5 years ago

I was experiencing the same issue today when setting k=0 as described in the vignette on page 12-13 for finding a rough estimate of the number of topics.

Would be really cool if you could look into that and maybe fix it!

Best, Flo

panoptikum commented 5 years ago

me too, but I remember that it was not an issue in the past...

andyjslee commented 5 years ago

I'm having this issue, too! Would love it if it can be resolved.

HarrisonBP commented 4 years ago

same problem here.

paullemmens commented 3 years ago

Me too.

The problem is that we all use a quanteda dfm as input. This conflicts with how the default value of the function parameter N is set: this simply takes the length of the documents parameter. This works if you use the STM internal documents and vocabulary approach but for a quanteda::dfm() this does not work.

In my case I had a fixed number of documents so I simply guestimated from example code what the approximate size of N would have been and entered that number manually into searchK().

tilloverlack commented 3 years ago

can you briefly explain how you did the estimation? how did you estimate the approximate size of N?

paullemmens commented 3 years ago

I think I looked up in the code how the parameter is set by default and verified the outcome with an example data set in stm format. That gave me sufficient clues to extrapolate to my own use case.

HTH!

On Sun, 2 May 2021, 12:12 tilloverlack, @.***> wrote:

can you briefly explain how you did the estimation? how did you estimate the approximate size of N?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bstewart/stm/issues/198#issuecomment-830784039, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGUIFXWDIHWYYGLL3NBKADTLUQQDANCNFSM4HNVF6WA .

pf5179vr commented 2 years ago

Hey guys, I followed @paullemmens 's instruction (thanks a lot) and set my searchK's parameter N to "floor(0.1 nrow(meta))" and it works. The default N is floor(0.1 length(documents)), and the "documents" variable means the number of documents you have. In our cases, this number is the row of our "meta", or you can simply set this number to your number of documents. Good luck!