Functions for creating replicate designs?

bschneidr commented 2 years ago

Some ideas for replicate designs to support, not already supported by the 'survey' package.

Successive Differences Replication (SDR)
Bootstraps
- Rescaled: Beaumont/Emond Extension of Rao-Wu bootstrap: https://www.mdpi.com/2571-905X/5/2/19
- Generalized bootstrap for surveys (Beaumont & Patak 2012 / Fay 1989)
- Options for Horvitz-Thompson, Sen-Yates-Grundy, SD1/SD2
- Pseudo-population
- Bayesian bootstrap?

To create these, could use an interface such as the following:

as_svrep_design(
  design = my_design_object,
  method = successive_differences(
    cycles = 3, n_replicates = 50
  )
)

This way, if the user wants more details on a specific replication method, they can look at a function specific to that method (e.g., by calling ?successive_differences() or ?pseudo_pop_boot()). The actual replicate creation could be handled by helpers such as create_success_differences_reps().

bschneidr commented 2 years ago

Would also be good to include a vignette on choosing among different replication methods, and choosing the number of bootstrap replicates to use.

bschneidr commented 2 years ago

For generalized bootstrap based on Beaumont & Patak (2012), basic code is super simple:

gen_boot_factors <- function(B, Sigma) {

  n <- nrow(Sigma)

  if (!isSymmetric.matrix(Sigma)) {
    stop("`Sigma` must be a symmetric matrix.")
  }

  replicate_factors <- t(
    MASS::mvrnorm(n = B,
                  mu = rep(1, times = n),
                  Sigma = Sigma,
                  empirical = FALSE)
  )

  if (any(replicate_factors < 0)) {
    rescaling_constant <- max(1 - replicate_factors)
    rescaled_replicate_factors <- (replicate_factors + (rescaling_constant-1))/rescaling_constant
  } else {
    rescaling_constant <- 1
    rescaled_replicate_factors <- replicate_factors
  }
  attr(rescaled_replicate_factors, 'tau') <- rescaling_constant

  return(rescaled_replicate_factors)
}

bschneidr commented 1 year ago

All of the bootstrap methods are looking good. Test coverage is back to 92%.

Only two issues to iron out:

(To-Do) When Poisson sampling is used to model nonresponse in a later stage of a multistage survey, it's unclear how to make the data appropriately represent this. It would be nice not to have to include records for nonrespondents. But if we don't include those records, then we don't get the bootstrap adjustments for the actual sampling stage correct. So for Poisson sampling, perhaps we need to give make_rwyb_bootstrap_weights() a new argument, such as inclusion_indicator. It doesn't seem like svydesign() can accomodate all the information needed for when a stage of the survey has nonresponse. Maybe a better interface would be something like the following:

specify_design(
    sampling_stage(type = "PPSWOR", prob = "PSU_PROB", id = "PSU_ID", stratum = "FIRST_STAGE_STRATUM"),
    sampling_stage(type = "Poisson", prob = "PSU_RESP_PROB", id = "PSU_ID", response_indicator = "IS_RESPONDENT"),
    sampling_stage(type = "SRSWOR", prob = "SSU_PROB", id = "SSU_ID", stratum = "SECOND_STAGE_STRATUM")
)

(Done) Need to ensure that make_rwyb_bootstrap_weights() works correctly for three or more stages. Need to go through the Beaumont & Emond (2022) paper and make sure that the section on multistage sampling (three or more stages) is being implemented correctly in the package.

bschneidr commented 1 year ago

Some future updates that would be nice for the bootstrap methods:

Accommodate general multi-phase designs, with each phase's sampling allowed to be any of the designs currently supported for one-phase designs.
For generalized bootstrap of PPS designs, support methods that use approximate joint probabilities.
For generalized bootstrap, allow mixing of PPS, SRS, and systematic sampling at different stages.

Some other replication methods worth supporting:

Fay's generalized replication method, with optional balancing and optional sampling of replicates.
Successive differences replication?

bschneidr / svrep

Functions for creating replicate designs? #3