ColbyStatSvyRsch / surveyCV

R package {surveyCV}: K-fold cross-validation for complex sample survey designs, and associated paper (https://doi.org/10.1002/sta4.454)
7 stars 1 forks source link

Bug: Unable to handle survey designs which have been updated or subsetted #3

Closed bschneidr closed 2 years ago

bschneidr commented 2 years ago

While trying out the package, I noticed that the function cv.svydesign() breaks down if the survey design object has been subsetted or been updated, for example by using subset(design_object, ...) or transform(design_object, ...).

suppressPackageStartupMessages({
  library(survey)
  library(surveyCV)
  library(splines)
})

# Load example data

  data(NSFG_data)

# Works for the original design
  NSFG.svydes <- svydesign(id = ~SECU, strata = ~strata, nest = TRUE,
                           weights = ~wgt, data = NSFG_data)

  cv.svydesign(formulae = c("income ~ ns(age, df = 2)",
                            "income ~ ns(age, df = 3)",
                            "income ~ ns(age, df = 4)"),
               design_object = NSFG.svydes, nfolds = 4)
#>           mean     SE
#> .Model_1 22611 754.35
#> .Model_2 22504 747.77
#> .Model_3 22762 766.08

# Breaks as soon as we add new variables

  updated_design <- svydesign(id = ~SECU, strata = ~strata, nest = TRUE,
                           weights = ~wgt, data = NSFG_data) |>
    transform(dummy_variable = 1)

  cv.svydesign(formulae = c("income ~ ns(age, df = 2)",
                            "income ~ ns(age, df = 3)",
                            "income ~ ns(age, df = 4)"),
               design_object = updated_design, nfolds = 4)
#> Error in cv.svydesign(formulae = c("income ~ ns(age, df = 2)", "income ~ ns(age, df = 3)", : length(ids.formula) == 2 is not TRUE

# Also breaks when we subset

  subsetted_design <- svydesign(id = ~SECU, strata = ~strata, nest = TRUE,
                                weights = ~wgt, data = NSFG_data) |>
    subset(CASEID != "70627")

  cv.svydesign(formulae = c("income ~ ns(age, df = 2)",
                            "income ~ ns(age, df = 3)",
                            "income ~ ns(age, df = 4)"),
               design_object = subsetted_design, nfolds = 4)
#> Error in cv.svydesign(formulae = c("income ~ ns(age, df = 2)", "income ~ ns(age, df = 3)", : length(ids.formula) == 2 is not TRUE

The reason this bug arises is because cv.svydesign() relies heavily upon extracting design information from the call element of the design object, design_object$call. This is risky because the call object is very fragile: it gets updated whenever subset() or transform() is used, and for design object with the srvyr package it looks completely different.

# The reason for this is that the `call` element of the design object is fragile

  NSFG.svydes$call
#> svydesign(id = ~SECU, strata = ~strata, nest = TRUE, weights = ~wgt, 
#>     data = NSFG_data)
  updated_design$call
#> update(`_data`, ...)
  subsetted_design$call
#> subset(svydesign(id = ~SECU, strata = ~strata, nest = TRUE, weights = ~wgt, 
#>     data = NSFG_data), CASEID != "70627")

I'm happy to submit a pull request to "robustify" the handling of survey design objects and add some unit tests for this. Would you be open to that?

Thanks,

Ben

civilstat commented 2 years ago

Good catch! Yes, such a pull request would be welcome. Thanks!

civilstat commented 2 years ago

Sorry for my delay in getting back to you! This looks great -- thanks again for all of this.

Before I close this issue, I wanted to check a few things:

  1. Are you sure about formula_vars <- Reduce(x = formula_vars, f = unique) in cv.svydesign.R?
    When I try it, I only get the variables from the 1st formula. Did you want something like formula_vars <- unique(unlist(formula_vars)) instead?
library(survey)
library(surveyCV)

# Last example from ?survey::svydesign
library(RSQLite)
design_object <- svydesign(id=~dnum, weights=~pw, fpc=~fpc,
  data="apiclus1",dbtype="SQLite", dbname=system.file("api.db",package="survey"))
inherits(design_object, "DBIsvydesign")

formulae = c("api00~ell",
  "api00~ell+meals",
  "api00~ell+meals+mobility")

# Based on new code in cv.svydesign.R
formula_vars <- lapply(formulae, function(formula_string) all.vars(as.formula(formula_string)))
Reduce(x = formula_vars, f = unique)
## [1] "api00" "ell" 
unique(unlist(formula_vars))
## [1] "api00"    "ell"      "meals"    "mobility"
  1. Inside cv.svydesign(), when it calls cv.svy(), why did you switch to nest=FALSE?
    Is it just because users have already created a svydesign, and so if there were a problem with nest=FALSE it would have already thrown an error earlier?
    ...
    I guess in other words: is there ever any reason NOT to use nest=TRUE, other than simply skipping the extra step of creating a new guaranteed-unique clusters-in-strata variable when you know you already have one?

  2. Related to (2), in your test suites when you're testing the stratified cluster design, why create apiclus1[['DNUM_BY_STYPE']] instead of just using dnum?
    Essentially, folds.svy() itself enforces nesting: it makes random folds from PSUs separately within each stratum, whether or not we've specified nest=TRUE earlier. Do you know of any situations where this wouldn't be the right thing to do?

civilstat commented 2 years ago
  1. In the pull request comments, you wrote:

I think if the package design was changed so that cv.svy() was a wrapper around cv.svydesign() rather than the other way around, it could more easily handle raking/post-stratification/calibration. But I think that's an issue for another time, as it's not completely clear whether/how calibration should be taken into account, as you mention in the paper.

Yes, definitely worth thinking about further. I wonder if that swap would also help with database-backed objects:
Right now, thanks to your updates we can take a DBIsvydesign -- but we end up turning it into a dataframe immediately, which might be problematic for large databases.
But if we used subset and transform on the original svydesign object to get training sets etc. (instead of subsetting the dataframe and creating new svydesign objects)... then if users gave us a DBIsvydesign, presumably R would be able to use survey's internal tricks and keep everything in the database as much as possible. Do you think that's worth pursuing?

civilstat commented 2 years ago

Closing for now, based on @bschneidr's May pull request, merged in with minor changes of my own. But I'll open a separate issue regarding the suggestion of

if the package design was changed so that cv.svy() was a wrapper around cv.svydesign()...