ColbyStatSvyRsch / surveyCV

R package {surveyCV}: K-fold cross-validation for complex sample survey designs, and associated paper (https://doi.org/10.1002/sta4.454)
7 stars 1 forks source link

cv.svyglm() issue with design id length? #1

Closed amsutton closed 2 years ago

amsutton commented 2 years ago

Hi there,

I'm a student learning to use both surveypackage and your package (thank you! This is great!), so forgive me if this question is basic. I am attempting to call cv.svyglm() using a svyglm object whose design is specified as:

design <- survey::svydesign(id=~1, data=dat, weights= ~weight)

So, it has weights, which your examples do not, but I hope that doesn't matter.

The svyglm object is specified as: m21 <- svyglm(mask ~ wave + ppreg9 + local_trust + state_trust + federal_trust + prop_white + prop_black + prop_other + age + (urban + race_ethnicity + employment)^2 -1, design=design, family=quasibinomial(link="logit"))

When I call surveyCV::cv.svyglm(m21, nfolds = 10) I receive the following error:

Error in cv.svydesign(design_object = design_object, formulae = formulae, : length(ids.formula) == 2 is not TRUE

Can you suggest how I might be able to proceed? I am so thrilled to have found a solution for cv using survey objects, and I'd love to be able to continue to use it!

Thanks for your attention!

civilstat commented 2 years ago

Thanks for the feedback! Would you mind trying it again, but naming the argument "ids" instead of "id" when you first run svydesign()?

If that works, we'll update the code to work with either "ids" or "id". If not, let me know and we can dig deeper to figure out the problem.

amsutton commented 2 years ago

Thanks for the help! I truly appreciate it.

I ran: survey::svydesign(ids=~1, data=dat, weights= ~weight)

and then successfully ran my svyglm(), assigned to m21 , but when I then ran surveyCV::cv.svyglm(m21, nfolds = 10) again, I received this error:

Error in glm.fit(x = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  :
  NA/NaN/Inf in 'y'
In addition: Warning messages:
1: In Ops.factor(y, mu) : ‘-’ not meaningful for factors
2: In Ops.factor(eta, offset) : ‘-’ not meaningful for factors
3: In Ops.factor(y, mu) : ‘-’ not meaningful for factors

So I guess this is an error on my end? Maybe it dislikes the quasibinomial which I opt for because it avoids the non-integer successes warning?

civilstat commented 2 years ago

Hmm. Those error+warning messages suggest that your response variable (mask ?) has NA, NaN, or Inf values, and/or it is a factor variable instead of a 0/1 variable. Is that true?

On the other hand, if that's true, you should be getting errors earlier when you run svyglm(), before you even get to cv.svyglm(). Could you post a reproducible example, including a small (fake) dataset, so that I can replicate the error and figure out what's causing it? http://adv-r.had.co.nz/Reproducibility.html

Finally, also try cv.svyglm(..., method="logistic"). Eventually our code will automatically detect whether it's linear or logistic, but we haven't implemented that yet.

amsutton commented 2 years ago

Ah, I see the error. Previously it hasn't mattered that my outcome was a factor, but when I assure the outcome is simply a binary variable, it works now. Thank you for being patient with me!

civilstat commented 2 years ago

No problem. I'm glad to hear it works now! Keep us posted if you have other feedback about the package or you think of features you might need that aren't implemented yet.

civilstat commented 2 years ago

@amsutton As of March 1, 2022, the latest package version v0.1.2 (on GitHub but not yet on CRAN) now allows logistic regression to use factor response variables. Thanks again for bringing up this issue!