Closed bschneidr closed 2 years ago
Good catch! Yes, such a pull request would be welcome. Thanks!
Sorry for my delay in getting back to you! This looks great -- thanks again for all of this.
Before I close this issue, I wanted to check a few things:
formula_vars <- Reduce(x = formula_vars, f = unique)
in cv.svydesign.R
?formula_vars <- unique(unlist(formula_vars))
instead?library(survey)
library(surveyCV)
# Last example from ?survey::svydesign
library(RSQLite)
design_object <- svydesign(id=~dnum, weights=~pw, fpc=~fpc,
data="apiclus1",dbtype="SQLite", dbname=system.file("api.db",package="survey"))
inherits(design_object, "DBIsvydesign")
formulae = c("api00~ell",
"api00~ell+meals",
"api00~ell+meals+mobility")
# Based on new code in cv.svydesign.R
formula_vars <- lapply(formulae, function(formula_string) all.vars(as.formula(formula_string)))
Reduce(x = formula_vars, f = unique)
## [1] "api00" "ell"
unique(unlist(formula_vars))
## [1] "api00" "ell" "meals" "mobility"
Inside cv.svydesign()
, when it calls cv.svy()
, why did you switch to nest=FALSE
?
Is it just because users have already created a svydesign
, and so if there were a problem with nest=FALSE
it would have already thrown an error earlier?
...
I guess in other words: is there ever any reason NOT to use nest=TRUE
, other than simply skipping the extra step of creating a new guaranteed-unique clusters-in-strata variable when you know you already have one?
Related to (2), in your test suites when you're testing the stratified cluster design, why create apiclus1[['DNUM_BY_STYPE']]
instead of just using dnum
?
Essentially, folds.svy()
itself enforces nesting: it makes random folds from PSUs separately within each stratum, whether or not we've specified nest=TRUE
earlier. Do you know of any situations where this wouldn't be the right thing to do?
I think if the package design was changed so that
cv.svy()
was a wrapper aroundcv.svydesign()
rather than the other way around, it could more easily handle raking/post-stratification/calibration. But I think that's an issue for another time, as it's not completely clear whether/how calibration should be taken into account, as you mention in the paper.
Yes, definitely worth thinking about further. I wonder if that swap would also help with database-backed objects:
Right now, thanks to your updates we can take a DBIsvydesign
-- but we end up turning it into a dataframe immediately, which might be problematic for large databases.
But if we used subset
and transform
on the original svydesign
object to get training sets etc. (instead of subsetting the dataframe and creating new svydesign
objects)... then if users gave us a DBIsvydesign
, presumably R would be able to use survey
's internal tricks and keep everything in the database as much as possible. Do you think that's worth pursuing?
Closing for now, based on @bschneidr's May pull request, merged in with minor changes of my own. But I'll open a separate issue regarding the suggestion of
if the package design was changed so that
cv.svy()
was a wrapper aroundcv.svydesign()
...
While trying out the package, I noticed that the function
cv.svydesign()
breaks down if the survey design object has been subsetted or been updated, for example by usingsubset(design_object, ...)
ortransform(design_object, ...)
.The reason this bug arises is because
cv.svydesign()
relies heavily upon extracting design information from thecall
element of the design object,design_object$call
. This is risky because thecall
object is very fragile: it gets updated wheneversubset()
ortransform()
is used, and for design object with thesrvyr
package it looks completely different.I'm happy to submit a pull request to "robustify" the handling of survey design objects and add some unit tests for this. Would you be open to that?
Thanks,
Ben