ColbyStatSvyRsch / surveyCV

R package {surveyCV}: K-fold cross-validation for complex sample survey designs, and associated paper (https://doi.org/10.1002/sta4.454)
7 stars 1 forks source link

Issue Encountered with folds.svy Function in surveyCV Package #7

Open beyound88 opened 7 months ago

beyound88 commented 7 months ago

Dear surveyCV Package Author I am reaching out to seek assistance with an issue I encountered while using your surveyCV package for cross-validation in my survey data analysis. I am currently using R version 4.3.2, surveyCV package version0.2.0 ,on a Windows 11 operating system.The issue I encountered is as follows:When attempting to use the folds.svy function for stratified sampling of my survey data, I encountered the following error message:Error in folds.svy(Data = Data, nfolds = nfolds, strataID = strataID, : not all nfolds <= table(foldID.clus$strat) are TRUE. This issue arises when I try to partition my dataset into 5 folds for cross-validation. Below is a simplified code example that reproduces this error:

NHANES_design <- svydesign( data = demo.train, ids = ~sdmvpsu, strata = ~sdmvstra, nest = TRUE, weights = ~nhs_wt, survey.lonely.psu = "adjust")

table(demo.train$sdmvpsu)

1 2 3

3578 3499 107

tmal1 <- svyglm(y~x+age+sex+eth+marital+PIR+edu, NHANES_design,family = quasibinomial())

cv.svyglm(tmal1, nfolds =5)

Error in folds.svy(Data = Data, nfolds = nfolds, strataID = strataID, : not all nfolds <= table(foldID.clus$strat) are TRUE.

cv.svyglm(tmal1, nfolds =2)

mean SE

. Model_1 0.45621 0.01

This issue arises when I try to partition my dataset into 5 folds for cross-validation. But when I try to partition my dataset into 2 folds for cross-validation, it can Runs smoothly!
I have attempted to check my dataset for any apparent issues and consulted the documentation, but have not found a way to resolve this issue. I am unsure if it is a problem with my data or if there are some limitations in the function itself that I am not aware of. I greatly appreciate the effort you have put into developing this package; it has been immensely helpful for my research. I would be grateful if you could assist me in resolving this issue or guide me on how to proceed further.
Thank you very much for your time and assistance!
civilstat commented 7 months ago

Hello, thank you for reaching out!

The error message says:
not all nfolds <= table(foldID.clus$strat) are TRUE

So your dataset probably has too few unique clusters/PSUs within each stratum. It looks like you are using NHANES data, for which sdmvpsu seems to give only 2 PSU IDs nested within each stratum. Does that sound right?

In order to account for the survey design properly, (1) all units from the same PSU must end up in the same fold, and (2) folds are created independently in each stratum. So surveyCV will not let you request more CV folds than the smallest number of PSUs per stratum. In your case, if any stratum has fewer than 5 PSUs in it, you cannot create 5 folds in that stratum, so you cannot create 5 folds overall.

But if your data does have 5 or more PSUs in every stratum, then something else is wrong -- if so, please let me know and we can dig into it further.

beyound88 commented 7 months ago

Thank you very much for your swift and insightful response. Your explanation regarding the error message and its relation to the number of primary sampling units (PSUs) within each stratum was incredibly helpful. I can confirm that I am indeed working with NHANES data, and your assumption about the PSU structure within the data is correct. Understanding that the limitation arises from the requirement that all units from the same PSU must end up in the same fold, and that folds need to be created independently in each stratum, clarifies the situation significantly. I now recognize that attempting to create more folds than there are PSUs in the smallest stratum is the root of the problem I encountered. The NHANES dataset I am working with does not have a sufficient number of PSUs in each stratum to allow for the number of folds I initially intended to create. This realization aligns perfectly with the scenario you described, highlighting the critical role of survey design considerations in statistical analysis.I appreciate the clarity and depth of your guidance. It is evident now that adhering to the survey's design constraints is paramount for ensuring the validity of any analysis performed. Thank you once again for your invaluable support and for the pivotal work you do in maintaining the surveyCV package. I look forward to any future opportunities to learn from your expertise.