bstewart / stm

An R Package for the Structural Topic Model
Other
397 stars 98 forks source link

Error for plot.estimateEffect #16

Open cschwem2er opened 8 years ago

cschwem2er commented 8 years ago

Hi, when trying to plot effect estimates in a model with a content variable an error is raised:

# post_type: factor (4 levels)
# numdate: continuous numeric
# core: dummy

comments <- stm(out$documents,out$vocab,K=50, 
                  prevalence =~ post_type + s(numdate) * core, 
                  content =~ core,
                data=out$meta,
                  max.em.its =150, seed=1337, emtol= 1e-4,
                  init.type='Spectral', verbose=T)

prep <- estimateEffect(c(12) ~ post_type + s(numdate) * core, 
                       comments, metadata=out$meta) # takes 20 minutes

Error in names(cdata) <- covariate : object 'cdata' not found

From what I can tell this does not depend on the plot method:

plot.estimateEffect(prep, 'post_type')
Error in names(cdata) <- covariate : object 'cdata' not found
)
bstewart commented 8 years ago

I'm not sure what is happening here. Perhaps you could send me an example that reproduces the error?

cschwem2er commented 8 years ago

I'd really like you to send an example but I'm afraid sharing this might be complicated. A compressed workspace with only the objects for the model stm, the metadata out and the effect estimates prep consumes > 2.GB in size; stm is listed with 1.2GB. Is it possible to compress the model object somehow? I'd assume memory consumption comes from very large matrices, for which it should be able to convert them to sparse matrices?

bstewart commented 8 years ago

I have a guess for you to try. Convert the covariates you use to numerics via as.numeric()

out$meta$numdate <- as.numeric(out$meta$numdate)

etc.

bstewart commented 8 years ago

Did this work out? I'm looking into some kind of memory compression for the next release. Unfortunately nothing is actually completely sparse in these models besides Kappa which isn't what is taking up the majority of the space. So any compression would need a function that recalculates things in order to reconstruct the full version

cschwem2er commented 8 years ago

Hi,

I just checked it and it indeed worked out! The content covariate was a binary variable and stored as integer. After converting it with as.numeric() everything was fine. I don't understand why the difference between integer and numeric is a problem, but maybe that's just one of the things a Python user will never understand about R ;-)

bstewart commented 8 years ago

I don't understand it either honestly.

cschwem2er commented 8 years ago

Just in case the compression is still something your are trying to tackle in the next release, maybe you can also have a look at compression for estimateEffect objects. At the moment I'm trying to create a shiny app for STM inspections and need to estimate effects for all topics. Trying this with a a 50 topic, 900k documents, ~ 7000 vocab model is far too big:

> prep <- estimateEffect(c(1:50) ~ post_id + s(numdate, degree=2) * core, 
+                        p50, metadata=out$meta)
Error: cannot allocate vector of size 25.3 Gb

Let me know if I can help you somehow, although my R programming skills are limited.

bstewart commented 8 years ago

Wow. No kidding. That's pretty intense. I'm glad to see estimation works for you at that scale at all!

Yeah there are definitely ways to solve this problem but it involves rather different methods to solve the least squares problem. Even lm() will give you some trouble at that scale. I'll give it some thought.

On Mon, Jul 4, 2016 at 5:11 AM Carsten Schwemmer notifications@github.com wrote:

Just in case the compression is still something your are trying to tackle in the next release, maybe you can also have a look at compression for estimateEffect objects. At the moment I'm trying to create a shiny app for STM inspections and need to estimate effects for all topics. Trying this with a a 50 topic, 900k documents, ~ 7000 vocab model is far too big:

prep <- estimateEffect(c(1:50) ~ post_id + s(numdate, degree=2) * core,

  • p50, metadata=out$meta) Error: cannot allocate vector of size 25.3 Gb

Let me know if I can help you somehow, although my R programming skills are limited.

— You are receiving this because you modified the open/close state.

Reply to this email directly, view it on GitHub https://github.com/bstewart/stm/issues/16#issuecomment-230243372, or mute the thread https://github.com/notifications/unsubscribe/AAjPOWpLbeY6zqHOF6CSC3JjltiJPMrPks5qSM5HgaJpZM4I4O8B .

tingleyd commented 8 years ago

Carsten Feel free to share any shiny stuff. I also have some undergrads exploring this. Dt

On Monday, July 4, 2016, Carsten Schwemmer notifications@github.com wrote:

Just in case the compression is still something your are trying to tackle in the next release, maybe you can also have a look at compression for estimateEffect objects. At the moment I'm trying to create a shiny app for STM inspections and need to estimate effects for all topics. Trying this with a a 50 topic, 900k documents, ~ 7000 vocab model is far too big:

prep <- estimateEffect(c(1:50) ~ post_id + s(numdate, degree=2) * core,

  • p50, metadata=out$meta) Error: cannot allocate vector of size 25.3 Gb

Let me know if I can help you somehow, although my R programming skills are limited.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bstewart/stm/issues/16#issuecomment-230243372, or mute the thread https://github.com/notifications/unsubscribe/ABr12rFCLbI0x4-zh7H4uls9q-7kGNKkks5qSM5HgaJpZM4I4O8B .

Dustin Tingley Professor of Government Government Department Harvard University Webpage http://scholar.harvard.edu/dtingley

Director of Graduate Studies, Government Department Faculty Director, Vice Provost on Advances in Learning (VPAL) Research Team http://vpal.harvard.edu/research Director, Program on Experience Based Learning in the Social Sciences http://projects.iq.harvard.edu/peblss/ Faculty Founder and Director, ABLConnect http://ablconnect.harvard.edu/, the Active and Activity Based Learning Connection Director, Undergraduate Research Scholars http://urs.iq.harvard.edu/, Institute of Quantitative Social Science -Contribute to the Research Resource Reservoir http://urs.iq.harvard.edu/pages/research-resources

cschwem2er commented 8 years ago

Hi Dustin, for stm I don't have an app yet as the computational issues (and object sizes) are a big problem at the moment. But maybe you or your students might be interested in another app that I created to inspect parliamentary written questions: http://pathways.polsys.uni-bamberg.de:443/questions/ The app is used in a project about representation of Citizens of Immigrant Origin and seems to be very helpful for people who lack computational skills.

Just write me an e-mail if you want to know more (carsten.schwemmer@uni-bamberg.de).

Cheers, Carsten

cschwem2er commented 8 years ago

Hi Brandon,

I'm afraid the Error in names(cdata) <- covariate : object 'cdata' not found problem still is not fully solved. For a different model, even after converting integers to numeric, trying to plot effects raises the error. As the model/data are not large this time, I can provide you with reproduction material. An R workspace with objects for metadata, model and effects can be downloaded here. And here is the syntax:

# converting variables
posts$numdate <- as.numeric(posts$date_post - min(posts$date_post))
posts$type <- factor(posts$post_type, order=F)
levels(posts$type) <- list(link="link",
                             media=c("photo", "video"),
                             status=c("event", "status"))

# preprocessing (with quanteda)
Dfm <- dfm(posts$post_message,   
           ignoredFeatures = c(stopwords('german'), 
                               c('dass', 'wurde', 'wurden')),
           stem = T,
           removeTwitter=F,
           removeSeparators=T,
           removePunct=T,
           language='german')

processed <- convert(Dfm, to="stm", docvars = posts) 

out <- prepDocuments(processed$documents, processed$vocab, processed$meta,
                     lower.thresh = 4) # 5095 terms, 3743 documents

# fitting the model
posts30 <- stm(out$documents, out$vocab, K = 30, 
               init.type='Spectral', verbose=T, emtol=1e-5,
               prevalence =~ type+ s(numdate, degree=3) , 
               data=out$meta)

# effects
prep30  <- estimateEffect(c(1:30) ~ type + s(numdate, degree=3) , posts30,
                          metadata=out$meta)

plot.estimateEffect(prep30, "type", model=posts30, method="pointestimate",
                    ci.level=.95, topics=c(7,12))

Error in names(cdata) <- covariate : object 'cdata' not found

plot.estimateEffect(prep30, "numdate", model=posts30, method="continuous",
                    ci.level=.95, topics=c(8))

Error in names(cdata) <- covariate : object 'cdata' not found
bstewart commented 8 years ago

Carsten,

Thanks! So this actually works for me. I load the workspace and run:

plot.estimateEffect(prep30, "type", model=posts30, method="pointestimate",
                    ci.level=.95, topics=c(7,12))

and everything works. This works on the current version on CRAN as well as the development version on Github.

Can you do a sessionInfo() and send me what versions of everything you are using? Since I can't replicate the error it would also be helpful if you could generate it and then run a traceback() and send those results as well.

cschwem2er commented 8 years ago

Ok this is getting a bit strange. I prepared output from sessionInfo() and then noticed that there are a bunch of other packages loaded in the background. So I restarted R with an empty workspace, ran the code again and it worked without any issues. Could the problem maybe arise due to namespace clutterings? I will try to replicate it once again and if I'm able to I will also provide you with debugging information.

bstewart commented 8 years ago

Thanks. That's super interesting. If we knew what function was causing the problem I can try to make sure the proper version gets called by explicitly calling its namespace.

cschwem2er commented 8 years ago

Hi Brandon,

the cdata problem just appeared on my windows machine again. Here is the output from sessionInfo:

R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] stm_1.1.4            stringr_1.0.0        Rtsne_0.11           LDAvis_0.3.2         scales_0.4.0        
 [6] shinydashboard_0.5.1 shinyBS_0.61         plotly_3.6.0         ggplot2_2.1.0        ggnet_0.1.0         
[11] igraph_1.0.1         dplyr_0.5.0          shiny_0.13.2        

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.6        plyr_1.8.4         iterators_1.0.8    base64enc_0.1-3    viridis_0.3.4     
 [6] tools_3.3.0        digest_0.6.10      jsonlite_1.0       tibble_1.1         gtable_0.2.0      
[11] lattice_0.20-33    foreach_1.4.3      Matrix_1.2-6       DBI_0.4-1          yaml_2.1.13       
[16] lda_1.4.2          gridExtra_2.2.1    httr_1.2.1         htmlwidgets_0.7    glmnet_2.0-5      
[21] grid_3.3.0         R6_2.1.2           tidyr_0.5.1        magrittr_1.5       codetools_0.2-14  
[26] splines_3.3.0      matrixStats_0.50.2 htmltools_0.3.5    assertthat_0.1     mime_0.5          
[31] xtable_1.8-2       colorspace_1.2-6   httpuv_1.3.3       labeling_0.3       intergraph_2.0-2  
[36] stringi_1.1.1      network_1.13.0     munsell_0.4.3      slam_0.1-37        Cairo_1.5-9     

And this is the `traceback()``:

2: produce_cmatrix(prep = x, covariate = covariate, method = method, 
       cov.value1 = cov.value1, cov.value2 = cov.value2, npoints = npoints, 
       moderator = moderator, moderator.value = moderator.value)
1: plot.estimateEffect(prep, "day", topics = 3, method = "continuous")
bstewart commented 8 years ago

Are you trying to run the same example you sent me before or is this a new example?

Looking through the code again I am pretty sure I can see the problem and can get it fixed in a release coming this month. The key is just making sure that variables you are calling are typed appropriately. So if its numeric make sure its numeric etc. I'm not super sure why loading other packages would matter for that- but this is my guess.

If you want to post a reproducible example, i'm happy to work on it.

cschwem2er commented 8 years ago

Thank you for investigating and nice to hear that a new release is coming soon :) The output from above is from another example. Please find a workspace for reproduction here. I think the problem comes from two packages: dplyr and/or plotly. As soon as I load either of these packages the cdata bug appears. If I just load stm or any other additional package the plot works without any issues.

bstewart commented 8 years ago

Oh wow. Okay thanks for this. Its definitely dplyr. It changes the way subsetting with "[" works in data frames. That's irritating... I'll find a workaround before the next release. I'm really sorry about the inconvenience in the interim!

cschwem2er commented 8 years ago

No worries! I just wanted to mention again that the error is also raised for me if I only import plotly. This is a sessionInfo() before the import:

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] stm_1.1.4      quanteda_0.9.8

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.6        lattice_0.20-33    matrixStats_0.50.2 codetools_0.2-14   lda_1.4.2         
 [6] glmnet_2.0-5       foreach_1.4.3      slam_0.1-37        R6_2.1.2           chron_2.3-47      
[11] grid_3.3.0         magrittr_1.5       httr_1.2.1         stringi_1.1.1      data.table_1.9.6  
[16] ca_0.64            Matrix_1.2-6       splines_3.3.0      iterators_1.0.8    tools_3.3.0       
[21] stringr_1.0.0      parallel_3.3.0   

And this one is after the import:

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] plotly_3.6.0   ggplot2_2.1.0  stm_1.1.4      quanteda_0.9.8

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.6        ca_0.64            magrittr_1.5       splines_3.3.0      munsell_0.4.3     
 [6] colorspace_1.2-6   lattice_0.20-33    R6_2.1.2           foreach_1.4.3      stringr_1.0.0     
[11] httr_1.2.1         plyr_1.8.4         tools_3.3.0        parallel_3.3.0     grid_3.3.0        
[16] glmnet_2.0-5       data.table_1.9.6   gtable_0.2.0       htmltools_0.3.5    iterators_1.0.8   
[21] matrixStats_0.50.2 assertthat_0.1     digest_0.6.10      tibble_1.1         Matrix_1.2-6      
[26] gridExtra_2.2.1    lda_1.4.2          tidyr_0.5.1        viridis_0.3.4      htmlwidgets_0.7   
[31] base64enc_0.1-3    codetools_0.2-14   slam_0.1-37        stringi_1.1.1      scales_0.4.0      
[36] jsonlite_1.0       chron_2.3-47      

There's plyr .. =)

cschwem2er commented 8 years ago

Hi Brandon,

did you find a solution for the dplyr problem which is not yet visible on Github? And if not, can I help you somehow?

Thanks

bstewart commented 8 years ago

Would you mind reposting the test data? I just started looking at this again and I want to test my attempts at fixing but I'm having a hard time reproducing the error.

cschwem2er commented 7 years ago

My apologies for responding so late. Please find the test data from above here.

cschwem2er commented 7 years ago

Were you able to reproduce the error with my test data?

bstewart commented 7 years ago

Wow okay this took quite a while but I'm pretty confident I fixed this in Version 1.2 (currently in development branch, should be pulled into the master in the next few days).

For myself later if i need a reference: the issue is that dplyr makes everything a tibble which essentially does drop=FALSE by default. We had a bunch of things that looked like data[,covariate] where we intended it to give a vector that we could do some calculation on. So had to go in and change all of those to data[[covariate]] style subsetting operations.

cschwem2er commented 7 years ago

thanks for the update :) will 1.2 also be released on CRAN?

djacques7188 commented 5 years ago

I am still getting a similar error, but I have isolated the issue. I'll open a new issue with the suggested fix and link a PR.

holnburger commented 5 years ago

@djacques7188 is it already fixed? Because I'm running into this problem with the most recent version from github.

Error in produce_cmatrix(prep = x, covariate = covariate, method = method, : object 'cdata' not found

Even without loading dplyr (only stm) I get the cdata error.

djacques7188 commented 5 years ago

Notice in lines 42-52 of the produce_cmatrix() function that cdata is created using an if() statement that depends on the data type.

This section does not look for a logical type. I submitted a PR to fix this, but it has not been accepted yet.

The work around is to change any variable in your data that is logical to a character or integer before you train the model. You can do this by multiplying the logical by 1.

Hope this helps.

bstewart commented 5 years ago

Hey everyone, sorry for all this- I'll work on this today. I've been a bit behind with stm updates recently.

bstewart commented 5 years ago

Thanks again for @djacques7188 for fixing this. I should be getting back into STM development in the next couple of weeks but for now I pulled this into the master.

stevenjmorgan commented 5 years ago

Is this update in the latest CRAN release? I am receiving the same error message even after removing dplyr and plotly.

bstewart commented 5 years ago

It isn't- sorry. Obviously it will be in the next one, but for now I would use the github version!

stevenjmorgan commented 5 years ago

Not a problem! I pulled from github and it worked. Thank you!

jonneguyt commented 4 years ago

Just ran into this, pulled from github and still suffered from it. Then realized I was working with a "date" variable, not a logical one. While it made me revisit my specification, I just wanted to let you know that the fix does not do help users that have date variables in their specification

katwag1 commented 4 years ago

Hello! I am trying to use the plotestimate effect for doc covariates for month and year. Keep getting the same error as above. I am trying to run:

plot.estimateEffect(prep, covariate = "monthyear3", method = "continuous", topics = 2, model = newspaper_stm, printlegend = FALSE)

Initially, my month metadata was in the format "February" "January" etc. and year was "2012" "2013" etc. Created another column in the metadata for my stm object by combining the two and then inserted monthyear as the covariate for the plot estimate effect command but this gave me the cdata object not found error.

monthyear = paste(dfm2stm$meta$month, dfm2stm$meta$year) dfm2stm$meta$monthyear <- monthyear

So then I tried converting it into date format: dfm2stm$meta$monthyear2 = as.yearmon(dfm2stm$meta$monthyear) dfm2stm$meta$monthyear3 = as.Date(dfm2stm$meta$monthyear2, frac = 1)

Line one gives the format "Feb 2012" etc and line two gives it in the format 2012-02-30. Neither modification has gotten rid of the cdata object not found error.

Any suggestions on how I can modify my covariates for month and year to get rid of this issue? @jonneguyt have you possibly found a solution for date variables?

Any help would be greatly appreciated!

djacques7188 commented 4 years ago

@katwag1

The above example is not reproducible.

What are the prep and dfm2stm objects?

Did you retrain your estimateEffect object after changing the type?

katwag1 commented 4 years ago

@djacques7188 Apologies for not making it more clear. The dfm2stm is my converted dfm into stm object. Prep is my output from <- estimeEffect(). My issue is that my yearmonth covariate is a date and not numeric form.

Since posting, I have come across the tidystm package (https://github.com/mikajoh/tidystm) which has helped to some extent by converting the date into numeric form and using ggplot to plot the outcome of extract.estimateEffect. I figured it could be possible to re-label the numeric dates on the x-axis using a variation of scale_x_continuous(breaks=seq(), labels=c("",""))

The issue I still have, however, is that converting my date into numeric form e.g. 201201, 201512 for January 2012 and December 2015, is that plotting a continuous graph also includes 201252 which is not a month. I'm thinking there must be a way to specify scale_x_continuous(breaks=seq(), labels=c("","")) to plot the expected topic proportion over a time frame of 2012-2020 at interval points for every four months along the x-axis given that my yearmonth covariate is now in the form "201201, 201202, 201203...." etc.

jonneguyt commented 4 years ago

I believe I simply recoded the date variable as a numeric running from [1, to the maximum number of months since start]. If you don't do this, you'll indeed get weird jumps (201252 to 201301).

What you could do, is do exactly the same for your setting and then use the original date variable as the "label".

Granted, this is a bit of a hack and doesn't solve the underlying problem.

katwag1 commented 4 years ago

Hi @jonneguyt , thanks for the reply! This is definitely a much better way to think of it rather than converting to numeric in the form 201201. Its solved my issue!

aaronrudkin commented 4 years ago

I ran into this and was frustrated. In my case, the unexpected data type was datediff, a numeric type that has some custom S3 methods.

Inside produce_cmatrix, we see this code for constructing cdata (and related code later in the continuous case):

types <- lapply(prep$data, function(x) class(x)[1]) # Returns the class "datediff" for this column
...
if(types[covariateofinterest] == "character") ...
if(types[covariateofinterest] == "factor") ...
if(types[covariateofinterest] == "numeric" || types[covariateofinterest] == "integer") ...
names(cdata) <- covariate

The names(data) step will fail if the variable's type is not character, factor, numeric, or integer, but it'll fail in a nondescript way -- by erroring that cdata does not exist. These three steps can solve the problem once and for all:

  1. The initial type dispatch should not rely on classes, it should rely on functionality. It is not idiomatic R to check the item's first class because in R people use class names as decorators to add additional context-specific S3 methods to base types. Personally, I would simply try to coerce to numeric, and see if it fails, At the very least if you must use classes, check whether e.g."numeric" is IN the vector of class names, not whether it's the first item. Class order is not important.

  2. Given the choice to write cdata in this way, it makes more sense to use the if-else if-else idiom and have a default case at the end. Even if you don't know what type of data the covariate of interest is, you can still attempt to build cdata in a generic fashion. If it errors, it errors, but at least try.

  3. If this code errors, the error should be clear to users: this error is happening because of an unexpected type of the covariate of interest that can't be used to construct the cdata matrix. One of the striking things about this thread is that it's actually not even clear to most users why the error is occurring.

So to anyone who got here, the issue is that stm wants your covariate of interest to very narrowly be character, factor, number, or integer for point estimates, and numeric or integer for continuous, and if it's not those things, you're going to get this nondescript error. So try converting to one of those types and see if it works.