gergness / srvyr

R package to add 'dplyr'-like Syntax for Summary Statistics of Survey Data
213 stars 28 forks source link

`as_survey_twophase` loses data variables when using method "simple" #175

Closed apeterson91 closed 1 month ago

apeterson91 commented 3 months ago

Title and reprex below are hopefully self-explanatory. Haven't had time to dig in yet as to why this happens.

library(survey)
#> Loading required package: grid
#> Loading required package: Matrix
#> Loading required package: survival
#> 
#> Attaching package: 'survey'
#> The following object is masked from 'package:graphics':
#> 
#>     dotchart
library(srvyr)
#> 
#> Attaching package: 'srvyr'
#> The following object is masked from 'package:stats':
#> 
#>     filter

## two-phase simple random sampling.
data(pbc, package = "survival")
pbc$randomized <- with(pbc, !is.na(trt) & trt > 0)
pbc$id <- 1:nrow(pbc)
d2pbc <- twophase(
  id = list(~id, ~id), data = pbc, subset = ~randomized,
  # small edit from Lumley's first twophase example
  method = "simple"
)
svymean(~bili, d2pbc)
#>        mean     SE
#> bili 3.2561 0.2564

## srvyr example

# method 'full'

pbc %>%
  as_survey_twophase(
    id = list(id, id),
    subset = randomized,
    method = "full" # default option
  ) %>% 
  summarize(
    mean = srvyr::survey_mean(bili)
  )
#> # A tibble: 1 × 2
#>    mean mean_se
#>   <dbl>   <dbl>
#> 1  3.26   0.256

# method 'simple'
pbc %>%
  as_survey_twophase(
    id = list(id, id),
    subset = randomized,
    method = "simple"
  ) %>% 
  summarize(
    mean = srvyr::survey_mean(bili)
  )
#> Error in `dplyr::summarise()`:
#> ℹ In argument: `mean = srvyr::survey_mean(bili)`.
#> Caused by error:
#> ! object 'bili' not found

## Doesn't appear to have any "data" variables

pbc %>%
  as_survey_twophase(
    id = list(id, id),
    subset = randomized,
    method = "simple"
  )
#> Two-phase design: Called via srvyr
#> Phase 1:
#> Independent Sampling design (with replacement)
#> svydesign(ids = ~id)
#> Phase 2:
#> Independent Sampling design
#> svydesign(ids = ~id, fpc = `*phase1*`)
#> Sampling variables:
#>  - ids: `~id`
#>  - subset: randomized
#> Data variables: ()

Created on 2024-07-20 with reprex v2.1.0

bschneidr commented 2 months ago

Thanks for this bug report. I'll look into this as I'm currently digging into the two-phase design internals anyway for other work.

bschneidr commented 1 month ago

Sorry it took a while for me to get to this. The reproducible example was great, and much appreciated.

@apeterson91, would you mind trying installing the version of srvyr with the proposed change and replying here with whether this addresses the issue in your actual data application?

You can install from: remotes::install_github("bschneidr/srvyr@twophase-simple-method-fix")

apeterson91 commented 1 month ago

Works for me! I'll mark this as closed.