bschneidr / svrep

Tools for Creating, Updating, and Analyzing Survey Replicate Weights
https://bschneidr.github.io/svrep/
GNU General Public License v3.0
6 stars 1 forks source link

Bug: Conversion to RWYB bootstrap doesn't work if weights+fpc are specified #15

Closed bschneidr closed 1 year ago

bschneidr commented 1 year ago

Bug description

The bootstrap conversion with as_bootstrap_design() fails for survey design objects specified a certain way.

library(survey)
library(svrep)

data('mu284', package = 'survey')

# Manually specify weights
design_1 <- svydesign(
  data = mu284 |> transform(wt = (n1/5)*(n2/3)),
  ids = ~ id1 + id2,
  fpc = ~ n1  +  n2,
  weights = ~ wt
)

# Let 'survey' package automatically figure out weights
design_2 <- svydesign(
  data = mu284,
  ids = ~ id1 + id2,
  fpc = ~ n1  +  n2
)

# Convert to bootstrap
svrep::as_bootstrap_design(design_2) # Works
#> Call: svrep::as_bootstrap_design(design_2)
#> Survey bootstrap with 500 replicates.

svrep::as_bootstrap_design(design_1) # Fails
#> Error in make_rwyb_bootstrap_weights(num_replicates = replicates, samp_unit_ids = samp_unit_ids_by_stage, : `samp_unit_sel_probs` must have the same number of rows and columns as `samp_unit_ids`.

Created on 2022-12-13 with reprex v2.0.2

Bug explanation

Apparently the creation of the "allprob" element of a survey design object depends on whether the user uses the argument weights when specifying the design. If the user doesn't specify weights, then the 'survey' package automatically figures out the weights based on sampling fractions at each stage, and these are then used to construct allprob. But if the user specifies weights, then allprob is simply the inverse of the weights.

# Inspect structure of 'allprob'
design_1$allprob
#>              wt
#> 19.1 0.06000000
#> 45.1 0.03750000
#> 47.1 0.06000000
#> 50.1 0.03333333
#> 31.1 0.04285714
#> 19.2 0.06000000
#> 45.2 0.03750000
#> 47.2 0.06000000
#> 50.2 0.03333333
#> 31.2 0.04285714
#> 19.3 0.06000000
#> 45.3 0.03750000
#> 47.3 0.06000000
#> 50.3 0.03333333
#> 31.3 0.04285714
design_2$allprob
#>       n1        n2
#> 19.1 0.1 0.6000000
#> 45.1 0.1 0.3750000
#> 47.1 0.1 0.6000000
#> 50.1 0.1 0.3333333
#> 31.1 0.1 0.4285714
#> 19.2 0.1 0.6000000
#> 45.2 0.1 0.3750000
#> 47.2 0.1 0.6000000
#> 50.2 0.1 0.3333333
#> 31.2 0.1 0.4285714
#> 19.3 0.1 0.6000000
#> 45.3 0.1 0.3750000
#> 47.3 0.1 0.6000000
#> 50.3 0.1 0.3333333
#> 31.3 0.1 0.4285714
petulla commented 1 year ago

Related: It seems the fpc is not applied if weights is supplied.

data(api, package = "survey")

apiclus1$fpc <- 18
dclus1 <- svydesign(data = apiclus1,
                    id = ~dnum, weights = ~pw, fpc=~fpc)

# Create replicate-weights survey design
orig_rep_design <- as_bootstrap_design(dclus1, replicates = 5000,
                                       type = "Rao-Wu-Yue-Beaumont")

reps <- withReplicates(orig_rep_design, function(w, data) {
    weighted.mean(data$api99, w)
}, return.replicates=T)

r <- reps$replicates
print(quantile(r, 0.025))
print(quantile(r, 0.975))
confint(svymean(~api99, orig_rep_design), df=degf(orig_rep_design))

boots <- as.svrepdesign(dclus1, type="bootstrap",replicates=5000)
reps <- withReplicates(boots, function(w, data) {
    weighted.mean(data$api99, w)
}, return.replicates=T)

r <- reps$replicates
print(quantile(r, 0.025))
print(quantile(r, 0.975))
confint(svymean(~api99, boots), df=degf(boots))
> print(quantile(r, 0.025))
    2.5% 
560.2901 
> print(quantile(r, 0.975))
  97.5% 
651.425 
> confint(svymean(~api99, orig_rep_design), df=degf(orig_rep_design))
         2.5 %   97.5 %
api99 556.2206 657.7357
> 
> boots <- as.svrepdesign(dclus1, type="bootstrap",replicates=5000)
> reps <- withReplicates(boots, function(w, data) {
+     weighted.mean(data$api99, w)
+ }, return.replicates=T)
> 
> r <- reps$replicates
> print(quantile(r, 0.025))
    2.5% 
599.1006 
> print(quantile(r, 0.975))
   97.5% 
637.5235 
> confint(svymean(~api99, boots), df=degf(boots))
         2.5 %   97.5 %
api99 584.6227 629.3336
bschneidr commented 1 year ago

Hi @petulla,

Thanks for providing the reproducible example.

After reviewing this report, I don't think there's an issue with the 'svrep' code. Rather, I think the input data in this example is simply invalid.

In the example provided, the weights are inconsistent with the fpc value supplied. The weights used for constructing a bootstrap replicate should be the base sampling weights: that is, the weight should be the inverse of the inclusion probability. In the apiclus1 dataset, this should be 757/15, which is the number of population clusters divided by the number of sample clusters. However, Dr. Lumley notes in the documentation for the 'survey' package that the weights pw in the apiclus1 dataset are incorrect (see the "Details" section here.) And then when you set the fpc to 18 without updating the weights accordingly, that produced another inconsistency between the weights and the finite population corrections.

In other words, the reason you're getting unexpected results is that you've supplied invalid sampling weights or an invalid fpc, depending on which input one might arbitrarily decide is the definitive input to use for determining sampling probabilities. I think it makes sense to treat the sampling weights as the definitive input for determining sampling probabilities, since that works for both simple random sampling and unequal probability sampling, and that's what weights are supposed to represent.

Does that make sense and resolve things?

petulla commented 1 year ago

Indeed, makes sense! Thanks for the reply. I somewhere along the way got confused on the psu population count. As you mentioned, Lumely explains that they are incorrectly labeled in the dataset (and so are the weights, accordingly).

Running with appropriate values gives good results.

data(api, package = "survey")
apiclus1$fpc <- 757
apiclus1$pw <- 757/15
dclus1 <- svydesign(data = apiclus1,
                    id = ~dnum, weights = ~pw, fpc=~fpc)

# Create replicate-weights survey design
orig_rep_design <- as_bootstrap_design(dclus1, replicates = 5000,
                                       type = "Rao-Wu-Yue-Beaumont")

reps <- withReplicates(orig_rep_design, function(w, data) {
    weighted.mean(data$api99, w)
}, return.replicates=T)

r <- reps$replicates
print(quantile(r, 0.025))
print(quantile(r, 0.975))
confint(svymean(~api99, orig_rep_design), df=degf(orig_rep_design))

boots <- as.svrepdesign(dclus1, type="bootstrap",replicates=5000)
reps <- withReplicates(boots, function(w, data) {
    weighted.mean(data$api99, w)
}, return.replicates=T)

r <- reps$replicates
print(quantile(r, 0.025))
print(quantile(r, 0.975))
confint(svymean(~api99, boots), df=degf(boots))
    2.5% 
559.8802 
   97.5% 
651.7246 
         2.5 %   97.5 %
api99 555.2295 658.7268
   2.5% 
562.291 
   97.5% 
648.8981 
         2.5 %   97.5 %
api99 555.8275 658.1288