ColbyStatSvyRsch / surveyCV

R package {surveyCV}: K-fold cross-validation for complex sample survey designs, and associated paper (https://doi.org/10.1002/sta4.454)
7 stars 1 forks source link

`Error in appendfolds(Data = Data, nfolds = nfolds, strataID = strataID, : nfolds <= table(foldID.clus$strat) are not all TRUE` & `Error: $ operator is invalid for atomic vectors` when running `cv.svyglm` #2

Closed themichjam closed 2 years ago

themichjam commented 2 years ago

I'm having an issue when running cv.svyglm, the error Error: $ operator is invalid for atomic vectors. Apologies if this is simple, but I can not seem to figure it out. Reproducible example below.


#reprex example
# load 
library(survey)
library(srvyr)
library(tidymodels)
library(surveyCV)
data(api)

# stratified sample survey design
dstrata <- apistrat %>%
  as_survey_design(strata = stype,
                   weights = pw)

# svyglm
dglm <- svyglm(awards ~ meals,
               design = dstrata,
               na.action = na.omit,
               family = quasibinomial
)

# cv.svyglm
cv.svyglm(glm_object = dglm, nfolds = 4)
themichjam commented 2 years ago

I'm also getting the error Error in appendfolds(Data = Data, nfolds = nfolds, strataID = strataID, : nfolds <= table(foldID.clus$strat) are not all TRUE when running cv.survey on a subset of my actual data that mirrors the repro example above (except my data is majority factor class, and the vars within the svyglm are factor). I know there's not much (if any) info when googling this particular error, so any insight at all would be appreciated.

civilstat commented 2 years ago

Thanks for letting me know! I see now that our package doesn't correctly handle logistic regression with a factor response.

It may take me a few days to find the time to fix that. But meanwhile, could you try changing your response variable to 0/1s? In your first example, you could try apistrat$awards01 <- as.numeric(apistrat$awards == "Yes") before you create the svydesign object and see if it works then.

(If that doesn't fix it, let me know. I have not tested this package with srvyr yet and I haven't tackled how to handle missing values yet, so those might be possible culprits.)

civilstat commented 2 years ago

For the second issue, with appendfolds(), it seems that your data might have fewer than nfolds sampling units in some of your strata. Each cross-validation fold ought to include data from every stratum, in order to mimic the survey design appropriately. So if, say, you have a small stratum with only 3 responses, you can't use more than 3-fold CV.

If that's the problem, but you want to use more folds, you could consider collapsing several small strata together into larger "pseudo-strata." (Apologies if you're already familiar with this! But in case you're not, imagine that your survey design stratified people into 10-year age groups, but you got very few respondents in the 90-and-older stratum -- too few to get stable variance estimates or to run CV or whatever. Then it might be reasonable to collapse them with the 80-89 stratum, forming a single "80-and-older" pseudo-stratum. Then your variance estimates or CV won't be exactly right, but since you couldn't compute the right ones anyway, this might be a decent approximation.)

themichjam commented 2 years ago

apistrat$awards01 <- as.numeric(apistrat$awards == "Yes")

Yes, your fix worked for this issue! Reprex below

## reprex example
library(survey)
library(srvyr)
library(tidymodels)
library(surveyCV)
library(gtsummary)
library(report)
data(api)

# civilstat fix
apistrat$awards01 <- as.numeric(apistrat$awards == "Yes")

# stratified sample
dstrata <- svydesign(
  id = ~1,
  strata = ~stype,
  weights = ~pw,
  data = apistrat,
  fpc = ~fpc)

dglm <- svyglm(awards01 ~ meals,
               design = dstrata,
               na.action = na.omit,
               family = quasibinomial
)

cv.svyglm(glm_object = dglm, nfolds = 4)

> cv.svyglm(glm_object = dglm, nfolds = 4)
            mean     SE
.Model_1 0.66831 0.0208
themichjam commented 2 years ago

Error in appendfolds(Data = Data, nfolds = nfolds, strataID = strataID, : nfolds <= table(foldID.clus$strat) are not all TRUE

For this error, I applied the above fix of making the response var 0/1, and then played with the nfolds = command. Results were:

# 4 folds
cv.svyglm(glm_object = dglm, nfolds = 4)
Error in appendfolds(Data = Data, nfolds = nfolds, strataID = strataID,  : 
  nfolds <= table(foldID.clus$strat) are not all TRUE

# 3 folds
cv.svyglm(glm_object = dglm, nfolds = 3)
Error in appendfolds(Data = Data, nfolds = nfolds, strataID = strataID,  : 
  nfolds <= table(foldID.clus$strat) are not all TRUE

# 2 folds
cv.svyglm(glm_object = dglm, nfolds = 2)
Error in onestrat(x[index, , drop = FALSE], clusters[index], nPSU[index][1],  : 
  Stratum (1) has only one PSU at stage 1

# 1 fold
cv.svyglm(glm_object = dglm, nfolds = 1)
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
  contrasts can be applied only to factors with 2 or more levels

I'm very much a beginner with this package/area so I really appreciate your input!

civilstat commented 2 years ago

No problem!

For the appendfolds error, the message with 2 folds says Stratum (1) has only one PSU at stage 1.
PSU = primary sampling unit = either a respondent, or a cluster of respondents if you used cluster sampling within strata.

So if you run the equivalent of table(apistrat$stype) on your own data, I suspect you'll see that one or more of your strata has exactly one PSU.
(Or, if your survey design took cluster samples within strata, you might have to crosstabulate the stratum IDs and cluster IDs to see this: something like table(yourdata$strataID, yourdata$clusterID) should show that at least one stratum contains respondents from only one cluster.)

If that's true, then this is not a bug with the package -- it's an intentional limitation to make sure that every stratum has enough sampling units. It has to be up to the data analyst to decide how to collapse small strata together into pseudo-strata with several PSUs each, using what you know about the context / meaning of the strata.

If the data come from someone else, their documentation might give advice about how to collapse strata. And if it's a publicly available dataset and you don't mind sharing the link, I'd be happy to take a look myself. It would be good for our package to give an example of how to address this issue and how to form pseudo-strata.

themichjam commented 2 years ago

table(apistrat$stype)

The data isn't open, but is safeguarded. It's available through the UK Data Service (https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=6379), but if you're happy to, I can add you to the existing access I have (and this would make access much quicker). I think this data is a good example to work with as it's a large complex survey.

civilstat commented 2 years ago

Thanks for the offer to add me to the data access. But since it's not already public, I wouldn't be able to include an open source extract of this data in the package documentation or vignettes. I will find another example for package purposes.

The documentation (https://doc.ukdataservice.ac.uk/doc/6379/mrdoc/pdf/6379_apms_2007_dataset_documentation.pdf) looks a bit confusing! I don't know why they call the stratification variable CLUSTER -- is this UK terminology? What I would think of as survey design clustering, the PSUs, are in the variable called AREA. (Plus there are other variables CLUSTER1 etc that have to do with latent classes, not the survey design; and DVSTRAT which seems to be for post-stratification but not needed for accounting for the survey design???; and some more variables for combining the 2007 and 2014 datasets...)

I imagine you are running something like as_survey_design(ids = AREA, strata = CLUSTER, weights = WT_INTS1), and possibly also with nest = TRUE if the values of AREA are not distinct across values of CLUSTER.
Without seeing the data, I imagine that in most of the strata (CLUSTER) there should be several PSUs (AREA), and each PSU has several respondents. But based on your error message, there must be at least one value of CLUSTER for which there is only one value of AREA. Does that sound right?

themichjam commented 2 years ago

Thanks for the offer to add me to the data access. But since it's not already public, I wouldn't be able to include an open source extract of this data in the package documentation or vignettes. I will find another example for package purposes.

The documentation (https://doc.ukdataservice.ac.uk/doc/6379/mrdoc/pdf/6379_apms_2007_dataset_documentation.pdf) looks a bit confusing! I don't know why they call the stratification variable CLUSTER -- is this UK terminology? What I would think of as survey design clustering, the PSUs, are in the variable called AREA. (Plus there are other variables CLUSTER1 etc that have to do with latent classes, not the survey design; and DVSTRAT which seems to be for post-stratification but not needed for accounting for the survey design???; and some more variables for combining the 2007 and 2014 datasets...)

I imagine you are running something like as_survey_design(ids = AREA, strata = CLUSTER, weights = WT_INTS1), and possibly also with nest = TRUE if the values of AREA are not distinct across values of CLUSTER. Without seeing the data, I imagine that in most of the strata (CLUSTER) there should be several PSUs (AREA), and each PSU has several respondents. But based on your error message, there must be at least one value of CLUSTER for which there is only one value of AREA. Does that sound right?

Yes, the documentation is definitely confusing! I'm glad it's not just me who thinks so. The naming choices are also not what you would expect to see everyday in the UK - they've just added another layer of confusion for some reason.

ANd yes, you're also right about svydesign(data = t07s, strata = ~cluster, id = ~area, nest = TRUE, weights = ~wt_ints1)

So as for the error message, after running table(yourdata$strataID, yourdata$clusterID) the results look like this example col 597 x row 1 is a 0, but col 601 by row 100 is 16. And most of that table has 0 for an answer.

(apologies for the non reprex example)

Does that match up with your thoughts on the error message?

civilstat commented 2 years ago

Hmm. That looks like you've subset the data to just 16 respondents, and they are all in the same PSU and in the same stratum. Is that what you meant to do?

If so, CV may not be the right tool for this job. As a first approximation, maybe it's appropriate to treat those 16 respondents as a simple random sample from that PSU and that stratum, and just use simple logistic regression? Or svyglm with just the survey weights? If you really think CV is the right tool for just these 16 respondents, you could ignore clusters and strata when you create the svydesign object, and only give it the survey weights: svydesign(data = t07s, id = ~1, weights = ~wt_ints1).

But if you expected to have more than 16 people, double check your earlier steps, and let me know if the problem still persists after you recover the missing rows of data.

themichjam commented 2 years ago
cex1

So I ran table(subset_apms_07$cluster, subset_apms_07$area) on the actual data and I've included a non-disclosive screenshot of the table output. The only thing done to subset_apms_07 was I reduced it to the variables of interest needed for analysis (so 200 vars instead of the original 2000, but same # of observations).

And creating a svydesign object without the strata or cluster vars, and only the weights, did work! - but would this be considered unsound? ignoring the strata and cluster vars in the svydesign?

civilstat commented 2 years ago

I would not create a svydesign object without the strata or cluster vars UNLESS you're intentionally analyzing just a small subset of observations who are ALL within the same stratum and cluster. But it sounds like you aren't intending to do that.

So let's double-check: Are you sure you still have all the observations? Does nrow(subset_apms_07) show the number of rows/respondents/observations that you expect to have?

If so, are you sure you need nest=TRUE?
In your screenshot, it seems the cluster labels do NOT repeat across strata. For example, let's say your strata are postcodes 1,2,3... and in each stratum your clusters are villages.
If the villages had cluster-labels 101,102,103... in stratum 1, and then different villages reused the same cluster-labels 101,102,103... in stratum 2, you should set nest=TRUE .
But if the villages have cluster-labels 101,102,103 in stratum 1, and new cluster-labels 201,202,203 in stratum 2, you should set nest=FALSE.

One way to check this could be by(data = subset_apms_07$area, INDICES = subset_apms_07$cluster, FUN = function(x) sort(unique(x))). Do different clusters have the same area values, or entirely different area values?

civilstat commented 2 years ago

@themichjam Sorry I didn't realize until today that you posted a thumbs-up. I hope this means that nest=FALSE or one of the other suggestions resolved your appendfolds error! If not, you're welcome to email me directly. As far as I can tell, the appendfolds error seems to be arising due to nuances of working with this particular dataset, rather than due to a bug in surveyCV itself.

As for the first error you originally mentioned, I've fixed that in the March 1st 2022 update to v0.1.2 (on GitHub but not yet on CRAN). The updated package should be able to handle factor responses in a logistic regression without any troubles now.

themichjam commented 2 years ago

Thanks for all your help! The original error has indeed been sorted by the new update, and the second does seem to be an issue with the data I'm working with rather then appendfolds itself. I'm happy to email further comments. It's been interesting for sure!