RyanHornby / csSampling

Other
11 stars 2 forks source link

csSampling with multilevel models? #5

Open awcm0n opened 8 months ago

awcm0n commented 8 months ago

I'm interested in using the csSampling package to run multilevel models on complex survey data, but I didn't succeed in fitting a simple random-intercept model. After the Stan model was fit, the process stalled without error message. So my question is: Is there any guidance as to what types of models can and cannot be fit with the csSampling package?

mrdwill commented 7 months ago

Thanks for your patience. Could you provide more information about the model and the data dimensions? We've used this to fit simple random intercept models before. Did you use the brms wrapper or a custom Stan model? If the stan model fit, then issue is the post-processing. There some data type conversions that could be inefficient for large numbers of samples/draws. I'll start looking into it, so any specifics you can provide would greatly help!

mrdwill commented 7 months ago

I think the crux of the issue is here: https://discourse.mc-stan.org/t/as-matrix-for-unconstrained-parameters/11528/2 the current cs_sampling function spends a lot of effort (nested for-loops) converting a list of constrained stan parameters to a matrix of unconstrained parameters - row by row. There may be a work-around to read from a diagnostic file csv output.

awcm0n commented 7 months ago

Thanks for looking into the issue. I created a minimal example of what I'm trying to do. The code below loads a 2-year MEPS longitudinal data file in wide-format that is converted to long. The goal of the analysis is to determine the change in k6sum from 2019 to 2020. Instead of including a person-level fixed effect, as economists are wont to do, I want to include a person-level random intercept, (1|dupersid), in the model. The Stan model fit, but the post-processing appears to be stuck.

if(!require("MEPS")) {
  library(devtools)
  install_github("e-mitchell/meps_r_pkg/MEPS")
}
library(tidyverse)
library(MEPS)
library(janitor)
library(survey)
library(srvyr)
library(csSampling)
library(brms)

# create long data set that contains a person's (dupersid) k6 score in 2019 and 2020 
dat <- read_MEPS(file = "h225") %>% # load panel data from MEPS
  clean_names() %>% 
  dplyr::select(dupersid, varpsu, varstr, lsaqwt, age=age2x, k6sum2, k6sum4) %>% 
  pivot_longer(cols = c(k6sum2, k6sum4), names_to = "k6round", values_to = "k6sum") %>% 
  mutate(year = ifelse(k6round=="k6sum2", 2019, 2020) |> as.factor()) %>% 
  dplyr::select(-k6round) %>% 
  mutate(across(where(is.numeric), \(x) as.numeric(x)))

# I want to analyse respondents 18 years and older. To do so, I calculate the mean
# weight for this subsample

# subset of respondents 18 years and older
dat_stan <- dat %>% 
  filter(age>=18 & !is.na(k6sum) & lsaqwt>0) 

mwgt <- mean(dat_stan$lsaqwt)

# scale weights
dat$wgt <- dat$lsaqwt/mwgt
dat_stan$wgt <- dat_stan$lsaqwt/mwgt

# create the design object
dsgn <- dat %>% 
  as_survey_design(ids = varpsu, strata = varstr, weights = wgt, nest = TRUE) %>% 
  filter(age>=18 & !is.na(k6sum) & lsaqwt>0)

set.seed (12345)
model_formula <- formula("k6sum|weights (wgt) ~ year + (1|dupersid)")

mod.brms <- cs_sampling_brms(svydes = dsgn,
                             brmsmod = brmsformula(model_formula, center = FALSE),
                             data = dat_stan,
                             family = gaussian(),
                             ctrl_stan = list(chains = 1, iter = 2000, warmup = 1000, thin = 1))