bschneidr / fastsurvey

A fork of the `survey` R package, using {Rcpp}
8 stars 0 forks source link

Disagreement between base R and C++ with `survey.lonely.psu = 'adjust'` and `survey.adjust.domain.lonely = TRUE` #10

Open bschneidr opened 1 year ago

bschneidr commented 1 year ago

The unit tests for domain lonely PSUs (i.e., lonely PSUs from subsetting) are showing disagreement between base R and C++ when using survey.lonely.psu = 'adjust' and survey.adjust.domain.lonely = TRUE.

I'll follow up here with an explanation of what the two functions are doing differently.

   survey.lonely.psu survey.adjust.domain.lonely one.stage results_match
1          certainty                       FALSE      TRUE          TRUE
2             remove                       FALSE      TRUE          TRUE
3            average                       FALSE      TRUE          TRUE
4             adjust                       FALSE      TRUE         FALSE
5          certainty                        TRUE      TRUE          TRUE
6             remove                        TRUE      TRUE          TRUE
7            average                        TRUE      TRUE          TRUE
8             adjust                        TRUE      TRUE         FALSE
9          certainty                       FALSE     FALSE          TRUE
10            remove                       FALSE     FALSE          TRUE
11           average                       FALSE     FALSE          TRUE
12            adjust                       FALSE     FALSE         FALSE
13         certainty                        TRUE     FALSE          TRUE
14            remove                        TRUE     FALSE          TRUE
15           average                        TRUE     FALSE          TRUE
16            adjust                        TRUE     FALSE         FALSE
bschneidr commented 1 year ago

The R and the C++ implementations of survey.lonely.psu = "adjust" for domain-lonely PSUs differ in terms of the values they use for the PSUs that are not part of the current subset. The base R implementation makes the variance contribution of the non-domain PSUs equal to zero, while the C++ implementation gives the non-domain PSUs a non-zero variance contribution.

Formulas

Suppose we have a stratum $h$ with $nh$ sampled PSUs, but only one PSU with any observations in the current domain. Denote $y{hi}$ the PSU total for PSU $i$ in stratum $h$. And denote $\bar{Y}^{\star}$ the PSU average from across all strata (i.e., $\hat{Y} / n$).

For convenience, we'll index the lonely PSU as $hi=h1$.

Here's what base R multistage() is doing:

$$ v_{base} \left( \hat{Y_h} \right) = \frac{n_h}{nh - 1} \left[ (y{h1} - Y^{\star})^2 + (n_h - 1)(0 - 0)^2 \right] $$

And here's what the C++ multistage() is doing:

$$ v_{C++} \left( \hat{Y_h} \right) = \frac{n_h}{nh - 1} \left[ (y{h1} - Y^{\star})^2 + (n_h - 1)(0 - \bar{Y}^{\star})^2 \right] $$

Which Formula Is Preferable?

For an influence function (e.g., in svymean(), svyglm(), etc.) it doesn't really matter which of the two above options is used, since $\bar{Y}^{\star}$ should be zero. So the choice of formula only really matters for estimating a population total.

For a domain population total, I think neither of the formulas above is preferable. Instead, it's better to just use the usual variance estimator, where each PSU is centered around the mean PSU from the stratum, since that provides an unbiased variance estimate for a domain total. For a stratum with only a single observation in the domain of interest, the usual variance estimator can be written as:

$$ v \left( \hat{Y_h} \right) = \frac{n_h}{nh - 1} \left[ (y{h1} - \bar{Y_h})^2 + (n_h - 1)(0 - \bar{Y_h})^2 \right] $$

What to Do?

I'm still figuring this out. I'll update this GitHub issue when I have a recommendation.

One approach would be to allow multistage()/multistage_rcpp() to know whether they're being used for estimating equations or for totals, so that it can avoid performing the domain lonely PSU adjustments if it's being used for totals.

Another approach would be to throw a warning in svytotal() when they've set the option survey.lonely.psu = 'adjust' and survey.adjust.domain.lonely = TRUE, and then just pick one of the formulas above to use in both multistage() and multistage_rcpp().

colin-peterson commented 1 year ago

I would suggest producing the same estimates, under default behavior, as does the original survey package, and saving the alternative formulations for some other (new?) survey.lonely.psu options. FWIW, the "adjust" option for survey will produce the same standard errors as Stata under their corresponding "centered" option, so they seem to have come to similar conclusions.

Thanks for your work on this package.

bschneidr commented 1 year ago

Hi @colin-peterson, thanks for weighing in. Have you read the closely related blog post on the bugs in Stata and the survey package?

https://www.practicalsignificance.com/posts/bugs-with-singleton-strata/

The current behavior of the official releases for both ‘survey’ and Stata has very undesirable behavior for estimating the variance of population totals, in part because nobody has really been clear on what the recentering methods are supposed to do for totals.

colin-peterson commented 1 year ago

Thanks for drawing my attention to the blog post.

I appreciate the workup. You may be right that both survey and Stata are flawed here. It's unfortunate that the logic is not better documented and justified. My sense is that the number of people who have thought deeply about this is low. I would still suggest though that providing facility to at least reproduce their behavior, if the user desires, would be useful nonetheless. Replicating published results which use this behavior would be one example.

As a design choice, I think if the goal of fastsurvey is mainly to be a faster (backend) version of survey, then I'd go a step further and suggest still that having the same default behavior would be appropriate. Users who try switching from survey to fastsurvey (as I did) may not think to compare results with the original package, and unknowingly move forward with results built on an alternatively opinionated variance estimation. One way to address this may be to add a warning or message when the adjust option is used, which references the issue and suggests one or more alternatives. If the maintainers of survey and Stata are alerted to this and subsequently change their behavior, default or otherwise, then fastsurvey could follow suit.

Your blog post asks whom to contact on the Stata end. I think Nick Cox is likely the right person. He's quite active on both SO and the statalist forums. I would suggest making a post on one of those, if you could. I would be curious to hear his thoughts on this and would like to follow along.

bschneidr commented 1 year ago

I appreciate those points and your suggestion about contacting Nick Cox- that’s definitely helpful. Some context that might be helpful is that this GitHub repo is really intended to help Thomas incorporate the faster C++ functions directly into the ‘survey’ package and for me to develop the C++ code and unit testing for it. Thomas has mostly finished updating survey to use the C++ code and accompanying tests working on adding this in. The reason I opened this ticket is because we have many unit tests to ensure the C++ variance estimation functions match their base R counterparts, and this is the only area where the unit tests have identified an inconsistency. This issue with domain lonely PSUs is really the last lingering issue before the C++ integration is settled.

Relatedly, Thomas also updated the development version of ‘survey’ to fix the bug I wrote about in that post. When the update to ‘survey’ goes on CRAN, there will be a new package option that allows the user to toggle between using C++ or base R under the hood, but with 95% of the time users sticking with the new default C++ computations.

Hopefully that context addresses the concerns you’re expressing about reproducibility and results matching.

colin-peterson commented 1 year ago

Thank you for the context. I'm really glad to hear the original package is being updated with the new backend.

Do you know if it will be possible to reproduce the existing/bugged adjust values while still benefitting from the C++ speed and memory improvements? That might be useful in some contexts.

bschneidr commented 1 year ago

The behavior for everything but totals is not affected at all by using C++ or by using the fixed version of the survey package that is coming out: you will still get the same results for svymean(), svyglm(), etc.

However, when there are lonely PSUs, you will get different variance estimates for totals from svytotal(). There will be no way to reproduce older incorrect variance estimates for totals while using C++. I'd suggest that that's fine, because the existing/bugged adjust behavior for svytotal() is pretty badly incorrect (though nowhere near as wildly incorrect as Stata).

bschneidr commented 1 year ago

Just noting feedback from Thomas here so it's easier for me to keep track of:

I think the best resolution for the remaining issue is to warn in svytotal() -- as you say, the whole concept of 'adjust' isn't really right for totals. There's a bit of an issue in that some functions use svytotal() on influence functions (including the svyVGAM package) because it's simpler than calling svyrecvar() or twophasevar() as appropriate, but that's fixable.

Following up: I think what you have in the C++ code is preferable to the R code. Note that in multistage sampling without replacement $Y^*$ won't necessarily be near zero, because this is all recursive, so $Y^*$ will be an average over all the stage-2 strata within the current stage-1 stratum.

These are all good points. I'll make some suggested updates to a few places to implement these changes and send it Thomas for review.

bschneidr commented 1 year ago

The updates in 913c5f5 and 774b11ba handles the recentering more elegantly, I think, and in a way that resolves these mismatches.

The idea is that when onestage() passes the data x on to onestrat(), the object x includes an attribute named "recentering" that indicates how the cluster totals should be recentered. Then the recentering happens within onestrat(), after the cluster totals have been calculated and after rows of zeroes have been added to the matrix of cluster totals. This is cleaner I think and also causes the formula used in R to match the formula in C++.

If the user doesn't specify lonely.psu = "adjust", then the value of attr(x, "recentering") is simply 0.

bschneidr commented 1 year ago

I'll update Thomas to let him know the two functions match now, and check how to go about handling the svytotal() warning.

colin-peterson commented 9 months ago

Sorry to necro - I hope this is an OK place to post this. Is it still your finding that the bugged behavior in the current CRAN version of survey, and in Stata, only applies to svytotal() and not svyglm(), svymean(), etc? If I'm reading the literature correctly (e.g., here) I understand that the recommended handling of domain lonely PSUs in the context of, say, a regression model with some missingness in a relevant variable, is to ignore options(survey.lonely.psu) and options(survey.adjust.domain.lonely) and instead use survey's subset argument (or Stata's equivalent subpop). However, it seems that you still get the same SEs as if you pre-filtered the dataset and used options(survey.lonely.psu = 'adjust').

With filter + adjust:

library(tidyverse)
library(broom)
library(survey) # most recent CRAN version (may 2023)

data(api)
filtered <- apiclus1 %>% 
  filter(dnum != 413) %>%
  slice(-31) # from tests/lonely.psu.R

options(survey.lonely.psu = 'adjust')
svyglm(growth ~ mobility, design = 
         svydesign(id = ~1, weights = ~pw, strata = ~dnum, data = filtered)) %>%
  tidy %>% pull(std.error)
#> [1] 3.0112312 0.1362315

Stata equivalent:

library(RStata)
options("RStata.StataPath" = "C:/Programs/Stata18/StataMP-64")
options("RStata.StataVersion" = 18)

stata(
  paste("svyset [weight=pw], strata (dnum) singleunit(centered)",
        "svy: reg growth mobility",
        "regsave, tstat pval ci",
        sep = "\n"
  ), data.out = TRUE, data.in = filtered, stata.echo = FALSE
) %>% pull(stderr) %>% rev
#> [1] 3.0112312 0.1362315

Unfiltered, with missingess added to column:

unfiltered <- apiclus1 %>%
  filter(dnum != 413) %>%
  # instead of removing row 31, we introduce column missingness for it
  mutate(growth = if_else(row_number() == 31, NA_real_, growth))

svyglm(growth ~ mobility, design = 
         svydesign(id = ~1, weights = ~pw, strata = ~dnum, data = unfiltered)) %>%
  tidy %>% pull(std.error)
#> [1] 3.0112312 0.1362315

# lonely PSU option is ignored:
options(survey.lonely.psu = 'certainty')
svyglm(growth ~ mobility, design = 
         svydesign(id = ~1, weights = ~pw, strata = ~dnum, data = unfiltered)) %>%
  tidy %>% pull(std.error)
#> [1] 3.0112312 0.1362315

# with subset argument:
svyglm(growth ~ mobility, subset = !is.na(growth), design = 
         svydesign(id = ~1, weights = ~pw, strata = ~dnum, data = unfiltered)) %>%
  tidy %>% pull(std.error)
#> [1] 3.0112312 0.1362315

Stata equivalent:

stata(
  paste("svyset [weight=pw], strata (dnum)",
        "svy, subpop(if growth != .): reg growth mobility",
        "regsave, tstat pval ci",
        sep = "\n"
  ), data.out = TRUE, data.in = unfiltered, stata.echo = FALSE
) %>% pull(stderr) %>% rev
#> [1] 3.0112312 0.1362315

Different result with options(survey.lonely.psu = 'adjust') and options(survey.adjust.domain.lonely = TRUE):

options(survey.lonely.psu = 'adjust')
options(survey.adjust.domain.lonely = TRUE)
svyglm(growth ~ mobility, subset = !is.na(growth), design = 
         svydesign(id = ~1, weights = ~pw, strata = ~dnum, data = unfiltered)) %>%
  tidy %>% pull(std.error)
#> Warning in onestrat(x[index, , drop = FALSE], clusters[index], nPSU[index][1],
#> : Stratum (406) has only one PSU at stage 1
#> Warning in onestrat(x[index, , drop = FALSE], clusters[index], nPSU[index][1],
#> : Stratum (406) has only one PSU at stage 1

#> Warning in onestrat(x[index, , drop = FALSE], clusters[index], nPSU[index][1],
#> : Stratum (406) has only one PSU at stage 1
#> [1] 3.0189884 0.1363199

Am I right in assuming that the subset/subpop behavior above is correct, and unaffected by the bug you've found with adjust? I'm puzzled because, if the normal recommendation is to avoid pre-filtering data, and to use the subset/subpop arguments instead, we seem to get the same thing as if we did do the filtering and used adjust/centered, at least for these regression model estimates.

bschneidr commented 9 months ago

Here's some useful background that might be helpful in making sense of this. With complex survey data, proper subsetting (or dropping observations with missing values) is handled a special way: the weights for filtered-out observations are just set to 0, and then the analysis software chugs along and does its variance computations keeping all the observations but using a weight of 0 for the "filtered-out" observations. Here's a short StackExchange answer (based on a nice blog post from Thomas Lumley) on that point:

https://stats.stackexchange.com/a/550867/94994

That's how domain estimation works: you set the weights to zero for everything outside of the domain. Unless you explicitly set the option options(survey.adjust.domain.lonely = TRUE), the 'survey' package doesn't make any adjustments for domain lonely PSUs (i.e., situations where a stratum has 2+ PSUs but only one of those PSUs has any nonzero weights).

That's why the first two code examples you provided give the same results.

library(tidyverse)
library(broom)
library(survey) # most recent CRAN version (may 2023)

data(api)

options(survey.lonely.psu = 'adjust')

filtered <- apiclus1 %>% 
  filter(dnum != 413) %>%
  slice(-31) # from tests/lonely.psu.R

unfiltered <- apiclus1 %>%
  filter(dnum != 413) %>%
  # instead of removing row 31, we introduce column missingness for it
  mutate(growth = if_else(row_number() == 31, NA_real_, growth))

filtered |>
  svydesign(data = _, id = ~1, weights = ~pw, strata = ~dnum) |>
  svyglm(growth ~ mobility, design = _) |>
  SE()
#> (Intercept)    mobility 
#>   3.0112312   0.1362315

unfiltered |>
  svydesign(data = _, id = ~1, weights = ~pw, strata = ~dnum) |>
  svyglm(growth ~ mobility, design = _) |>
  SE()
#> (Intercept)    mobility 
#>   3.0112312   0.1362315
bschneidr commented 9 months ago

Independently of the above, which I think addresses the main concern you had, the bug I mentioned in that one blog post applies in either of the following scenarios:

  1. You're estimating a population total
  2. You're estimating pretty much anything (total, mean, regression coefficient, etc.), are accounting for multiple stages of sampling (e.g., by specifying id = ~ PSU + SSU), and the lonely PSU occurs in the second stage.

The first scenario is much more common and the error will tend to be much worse in that scenario. The second stage is way less of a problem but there is still an unwanted bug.

The good news for R anyway is that the bug is fixed in the upcoming version of the 'survey' package, available on the R-Forge repository that Thomas Lumley uses: https://r-forge.r-project.org/scm/?group_id=1788 or in a GitHub mirror of that repository: https://github.com/bschneidr/r-forge-survey-mirror

If you want to install the fixed version, you can use the following:

install.packages("survey", repos="http://R-Forge.R-project.org")
colin-peterson commented 9 months ago

Thanks very much for your responses! They're very helpful. I took the models above and ran a number of different configurations, including the different versions of 'survey', to help me grock the bigger picture. I've condensed column names to make things fit but hopefully it's still interpretable. se.glm here is the SE of the $mobility coefficient in svylglm(growth ~ mobility) from before, and se.total is the SE of svytotal(~growth). If $svy.subset is TRUE then svyglm() gets the subset argument as before and svytotal() gets its svydesign wrapped with subset(), as Lumley does in his blog post.

    svy.lonely.psu svy.adj.dmn.lonely svy.subset prefiltered se.glm_4.2.1 se.glm_4.3 se.total_4.2.1 se.total_4.3
  1 fail           FALSE              TRUE       FALSE              0.136      0.136         13057.       13057.
  2 fail           FALSE              FALSE      TRUE              NA         NA                NA           NA 
  3 fail           FALSE              FALSE      FALSE              0.136      0.136         13390.          NA 
  4 fail           TRUE               TRUE       FALSE              0.136      0.136         13057.       13057.
  5 fail           TRUE               FALSE      TRUE              NA         NA                NA           NA 
  6 fail           TRUE               FALSE      FALSE              0.136      0.136         13390.          NA 
  7 certainty      FALSE              TRUE       FALSE              0.136      0.136         13057.       13057.
  8 certainty      FALSE              FALSE      TRUE               0.136      0.053         12903.        8014.
  9 certainty      FALSE              FALSE      FALSE              0.136      0.136         13390.          NA 
 10 certainty      TRUE               TRUE       FALSE              0.136      0.136         13057.       13057.
 11 certainty      TRUE               FALSE      TRUE               0.136      0.053         12903.        8014.
 12 certainty      TRUE               FALSE      FALSE              0.136      0.136         13390.          NA 
 13 remove         FALSE              TRUE       FALSE              0.136      0.136         13057.       13057.
 14 remove         FALSE              FALSE      TRUE               0.136      0.053         12903.        8014.
 15 remove         FALSE              FALSE      FALSE              0.136      0.136         13390.          NA 
 16 remove         TRUE               TRUE       FALSE              0.136      0.136         13057.       13057.
 17 remove         TRUE               FALSE      TRUE               0.136      0.053         12903.        8014.
 18 remove         TRUE               FALSE      FALSE              0.136      0.136         13390.          NA 
 19 average        FALSE              TRUE       FALSE              0.136      0.136         13057.       13057.
 20 average        FALSE              FALSE      TRUE               0.141      0.055         13390.        8317.
 21 average        FALSE              FALSE      FALSE              0.136      0.136         13390.          NA 
 22 average        TRUE               TRUE       FALSE              0.141      0.055         13390.        8317.
 23 average        TRUE               FALSE      TRUE               0.141      0.055         13390.        8317.
 24 average        TRUE               FALSE      FALSE              0.141      0.055         13390.          NA 
 25 adjust         FALSE              TRUE       FALSE              0.136      0.136         13057.       13057.
 26 adjust         FALSE              FALSE      TRUE               0.136      0.136         13057.       12928.
 27 adjust         FALSE              FALSE      FALSE              0.136      0.136         13390.          NA 
 28 adjust         TRUE               TRUE       FALSE              0.136      0.136         13209.       13066.
 29 adjust         TRUE               FALSE      TRUE               0.136      0.136         13057.       12928.
 30 adjust         TRUE               FALSE      FALSE              0.136      0.136         13390.          NA
code to reproduce table ``` r library(tidyverse) library(withr) oldsurveylib <- paste0(tempdir(), "\\old") newsurveylib <- paste0(tempdir(), "\\new") dir.create(oldsurveylib) dir.create(newsurveylib) install.packages("survey", lib = oldsurveylib) # CRAN, 4.2.1 install.packages("survey", repos="http://R-Forge.R-project.org", lib = newsurveylib) # 4.3 with_package("survey", lib.loc = oldsurveylib, { data(api) }) filtered <- apiclus1 %>% filter(dnum != 413) %>% slice(-31) # from tests/lonely.psu.R unfiltered <- apiclus1 %>% filter(dnum != 413) %>% # instead of removing row 31, we introduce column missingness for it mutate(growth = if_else(row_number() == 31, NA_real_, growth)) results <- tibble(.rows = 1) %>% mutate(survey.lonely.psu = list(c("fail", "certainty", 'remove', 'average', 'adjust')), survey.adjust.domain.lonely = list(c(FALSE, TRUE)), survey.subset = list(c(TRUE, FALSE)), data.prefiltered = list(c(TRUE, FALSE)), version = list(c("4.2.1", "4.3")) ) %>% reduce(names(select(., where(is.list))), .init = ., ~ unnest(.x, all_of(.y))) %>% # don't bother survey subsetting if using prefiltered data filter(!(survey.subset & data.prefiltered)) %>% mutate( SEs = pmap(., ~with(list(...), { lib.loc <- if(version == "4.2.1") oldsurveylib else newsurveylib dat <- if(data.prefiltered) filtered else unfiltered with_package("survey", lib.loc = lib.loc, { options(survey.adjust.domain.lonely = survey.adjust.domain.lonely) options(survey.lonely.psu = survey.lonely.psu) se <- tryCatch({ svyglm(growth ~ mobility, if(survey.subset) subset = !is.na(growth), design = svydesign(id = ~1, weights = ~pw, strata = ~dnum, data = dat)) %>% SE %>% .[2] }, error = function(e) NA_real_) svytot <- tryCatch({ svytotal(~growth, if(survey.subset) { subset( svydesign(id = ~1, weights = ~pw, strata = ~dnum, data = dat), !is.na(growth) ) } else { svydesign(id = ~1, weights = ~pw, strata = ~dnum, data = dat) } ) %>% SE %>% as.numeric }, error = function(e) NA_real_) list( se.svyglm = se, se.svytotal = svytot ) }) })) ) %>% unnest_wider(SEs) %>% mutate(se.svyglm = round(se.svyglm, 3)) %>% pivot_wider(names_from = version, values_from = c(se.svyglm, se.svytotal)) results %>% print(n = 30) ```

Many thanks to you and Dr Lumley for addressing the bug. While the changes from 4.2.1 to 4.3 (at least the ones I've thought through) are helpful, I admit I still have some residual confusion.

If I can trouble you with a few follow up questions: if I'm following you correctly, it seems as though without the subset argument, 'survey' has no way of distinguishing domain-lonely PSUs from design-lonely PSUs, in which case it assumes all lonely PSUs are domain-lonely. Then if survey.adjust.domain.lonely is left FALSE, no adjustment is performed for them, and hence we get the same results for the two models previously discussed (rows 25 and 26 in the table).

If I have that right, then I'm wondering (1) why (incorrectly) pre-filtering the data impacts svytotal() -- as you'd expect, as the blog post discusses, and as does occur in 4.3 contrasting rows 25 and 26 -- but does not impact the variance estimation in svyglm(). In other words, svyglm() doesn't seem to make use of the full design information that you lose if you pre-filter. It may be that I just need to get a better understanding of the delta method as applied to svyglm(). But I'm also wondering why (2), if no adjustment is being made for domain-lonely PSUs, we can still see differences between "adjust" and "average" even when survey.adjust.domain.lonely is FALSE, as in row 20 compared to row 26. It might be that I'm misunderstanding how domain-lonely PSUs are recognized and treated.

I appreciate your patience with me! A big thanks also for remaking the backend. It's a tremendous help -- hard to overstate how much it changes the usability of 'survey' with large models.

bschneidr commented 9 months ago

The subset argument of svyglm() is actually just equivalent to using the subset() function on the design object before passing it to svyglm(). You might find it helpful to take a look at the underlying R code for svyglm() to see how it handles that. But essentially it's equivalent to the following code:

unfiltered |>
  svydesign(data = _, id = ~1, weights = ~pw, strata = ~dnum) |>
  subset(!is.na(growth)) |>
  svyglm(growth ~ mobility, design = _) |>
  SE()

The survey package learns about design-lonely PSUs by keeping track of the strata sample sizes recorded at the time the design was created. When the design is created, the survey package counts the number of PSUs in the sample data, separately for each stratum, and records those in design_object$fpc$sampsize. You can see that here:

unfiltered |>
  svydesign(data = _, id = ~1, weights = ~pw, strata = ~dnum) |>
  getElement("fpc") |>
  getElement("sampsize")

For example, here's the C++ code that identifies design-lonely PSUs:

https://github.com/bschneidr/fastsurvey/blob/774b11bad5a7f7d2b10a2ddf157fd16d19b2cd06/src/arma_multistage.cpp#L92-L102

In contrast, the survey package learns about domain-lonely PSUs by counting the number of PSUs in the subset of data being analyzed. Here's the C++ code that identifies the domain-lonely PSUs:

https://github.com/bschneidr/fastsurvey/blob/774b11bad5a7f7d2b10a2ddf157fd16d19b2cd06/src/arma_multistage.cpp#L106-L116

One thing that you might find helpful for seeing what is being done is to call debug() on a low-level function like survey:::multistage to see what data is being passed to it in the different scenarios you're considering. For example, you'll see that incorrectly pre-filtering the data changes the design_object$fpc$sampsize object being passed to survey:::multistage_rcpp() or survey:::multistage().

bschneidr commented 9 months ago

For this part:

if no adjustment is being made for domain-lonely PSUs, we can still see differences between "adjust" and "average" even when survey.adjust.domain.lonely is FALSE, as in row 20 compared to row 26. It might be that I'm misunderstanding how domain-lonely PSUs are recognized and treated.

The pre-filtering means that you're getting incorrect values for design_object$fpc$sampsize, leading R to think that it has a design-lonely PSU rather than a domain-lonely PSU.

bschneidr commented 9 months ago

For this part:

If I have that right, then I'm wondering (1) why (incorrectly) pre-filtering the data impacts svytotal() -- as you'd expect, as the blog post discusses, and as does occur in 4.3 contrasting rows 25 and 26 -- but does not impact the variance estimation in svyglm(). In other words, svyglm() doesn't seem to make use of the full design information that you lose if you pre-filter. It may be that I just need to get a better understanding of the delta method as applied to svyglm().

There are four key observations which, taken together, should resolve this question:

  1. When you remove an observation from the data, this is implemented in survey software by setting the weights for removed observations equal to 0.
  2. For any statistic other than totals, the weighted sum of the influence functions equals 0. So the full-sample mean, the grand mean of PSUs, etc. should all equal 0.
  3. The delta method works by estimating the sampling variance-covariance of the influence functions.
  4. The option survey.lonely.psu = "adjust" handles lonely PSUs by recentering the lonely PSU's value around the grand mean of PSUs.

Those four things together explain why you would, correctly, get the exact same variance estimate in row 25 in that table (subsetting without pre-filtering) as in row 26 (pre-filtering and using survey.lonely.psu = 'adjust')