bschneidr / svrep

Tools for Creating, Updating, and Analyzing Survey Replicate Weights
https://bschneidr.github.io/svrep/
GNU General Public License v3.0
7 stars 1 forks source link

Feature Suggestion: Improve feasibility of generalized bootstrap for large datasets #20

Open bschneidr opened 1 year ago

bschneidr commented 1 year ago

Right now, the creation of the generalized bootstrap replicate factors requires a lot of computer memory, because:

  1. The algorithm forms the entire quadratic form matrix of the target variance estimator, which is $n \times n$. Even with a survey of only 1,000 cases, that results in a large matrix.
  2. Sampling from the multivariate normal distribution for $B$ replicates results in a $n \times B$ matrix, where $B$ can be quite large.

A couple ideas for making this more feasible are:

  1. Form bootstrap adjustment factors separately by stratum. The covariances of units in different strata are all zero, so we can get correct results by generating adjustment factors separately for each stratum. This means that instead of creating one big $n \times n$ covariance matrix, we can generate $H$ covariance matrices of size $n_h \times n_h$. Within each of the separate strata $H$, the created columns of replicate weights should be shuffled to ensure independence.
    • Update: (1) is done where possible.
  2. Use quadratic forms whose dimension matches the rank. For example, if a the dataset has 10 rows but three clusters, form a $3 \times 3$ quadratic form, generate adjustment factors for three clusters, and then expand the matrix of replicates to the within-cluster units.
    • Update: (2) is done.
  3. Use an alternative R package that is better able to simulate from high-dimensional covariance matrices. The "mvnfast" package might work well. https://cran.r-project.org/web/packages/mvnfast/mvnfast.pdf
    • Update: May adopt (3) at some point.
bschneidr commented 4 months ago

Unfortunately "mvnfast" has no unit tests and the minimal documentation has a lot of typos, so I don't think it's a good idea to rely on it.