ColbyStatSvyRsch / surveyCV

R package {surveyCV}: K-fold cross-validation for complex sample survey designs, and associated paper (https://doi.org/10.1002/sta4.454)
7 stars 1 forks source link

Robust svydesign handling #4

Closed bschneidr closed 2 years ago

bschneidr commented 2 years ago

This pull request changes the way that the functions folds.svydesign() and cv.svydesign() extract design information from a survey.design object. This makes these functions much safer, in that they now work with designs which have been subsetted or otherwise updated. They are also now compatible with survey designs from the 'srvyr' package. This resolves Issue #3.

Click for details on 'robustifying' the `svydesign` functions >Instead of using the `design_object$call` element, which is fragile for reasons discussed in #3, each design variable is pulled from the appropriate part of the design object (`design_object$strata`, `design_object$cluster`, `design_obj$fpc`). The weights are pulled by using `1/design$prob`, which is what the S3 method `weights(design_object)` does and is the way that the 'survey' package stores weights in general.

I also added informative warnings when the user supplies a survey design object with multiple stages of sampling, which warn the user that only first-stage sampling units and first-stage strata are used for cross-validation. In addition, I added an informative error message if the user attempts to supply a replicate design object or accidentally supplies a data frame instead of a survey design object.

As an aside, one thing this update doesn't improve is the handling of raked/post-stratified/calibrated survey design objects. This is because the cv.svy() function doesn't know how to handle calibration, and cv.svydesign() is essentially just a wrapper for cv.svy(). I think if the package design was changed so that cv.svy() was a wrapper around cv.svydesign() rather than the other way around, it could more easily handle raking/post-stratification/calibration. But I think that's an issue for another time, as it's not completely clear whether/how calibration should be taken into account, as you mention in the paper.

civilstat commented 2 years ago

Thanks! What you describe sounds good -- I should have time on Monday or Tuesday to look over the details and merge it in.