Open bschneidr opened 1 year ago
The R and the C++ implementations of survey.lonely.psu = "adjust"
for domain-lonely PSUs differ in terms of the values they use for the PSUs that are not part of the current subset. The base R implementation makes the variance contribution of the non-domain PSUs equal to zero, while the C++ implementation gives the non-domain PSUs a non-zero variance contribution.
Suppose we have a stratum $h$ with $nh$ sampled PSUs, but only one PSU with any observations in the current domain. Denote $y{hi}$ the PSU total for PSU $i$ in stratum $h$. And denote $\bar{Y}^{\star}$ the PSU average from across all strata (i.e., $\hat{Y} / n$).
For convenience, we'll index the lonely PSU as $hi=h1$.
Here's what base R multistage()
is doing:
$$ v_{base} \left( \hat{Y_h} \right) = \frac{n_h}{nh - 1} \left[ (y{h1} - Y^{\star})^2 + (n_h - 1)(0 - 0)^2 \right] $$
And here's what the C++ multistage()
is doing:
$$ v_{C++} \left( \hat{Y_h} \right) = \frac{n_h}{nh - 1} \left[ (y{h1} - Y^{\star})^2 + (n_h - 1)(0 - \bar{Y}^{\star})^2 \right] $$
For an influence function (e.g., in svymean()
, svyglm()
, etc.) it doesn't really matter which of the two above options is used, since $\bar{Y}^{\star}$ should be zero. So the choice of formula only really matters for estimating a population total.
For a domain population total, I think neither of the formulas above is preferable. Instead, it's better to just use the usual variance estimator, where each PSU is centered around the mean PSU from the stratum, since that provides an unbiased variance estimate for a domain total. For a stratum with only a single observation in the domain of interest, the usual variance estimator can be written as:
$$ v \left( \hat{Y_h} \right) = \frac{n_h}{nh - 1} \left[ (y{h1} - \bar{Y_h})^2 + (n_h - 1)(0 - \bar{Y_h})^2 \right] $$
I'm still figuring this out. I'll update this GitHub issue when I have a recommendation.
One approach would be to allow multistage()
/multistage_rcpp()
to know whether they're being used for estimating equations or for totals, so that it can avoid performing the domain lonely PSU adjustments if it's being used for totals.
Another approach would be to throw a warning in svytotal()
when they've set the option survey.lonely.psu = 'adjust'
and survey.adjust.domain.lonely = TRUE
, and then just pick one of the formulas above to use in both multistage()
and multistage_rcpp()
.
I would suggest producing the same estimates, under default behavior, as does the original survey
package, and saving the alternative formulations for some other (new?) survey.lonely.psu
options. FWIW, the "adjust" option for survey
will produce the same standard errors as Stata under their corresponding "centered" option, so they seem to have come to similar conclusions.
Thanks for your work on this package.
Hi @colin-peterson, thanks for weighing in. Have you read the closely related blog post on the bugs in Stata and the survey package?
https://www.practicalsignificance.com/posts/bugs-with-singleton-strata/
The current behavior of the official releases for both ‘survey’ and Stata has very undesirable behavior for estimating the variance of population totals, in part because nobody has really been clear on what the recentering methods are supposed to do for totals.
Thanks for drawing my attention to the blog post.
I appreciate the workup. You may be right that both survey
and Stata are flawed here. It's unfortunate that the logic is not better documented and justified. My sense is that the number of people who have thought deeply about this is low. I would still suggest though that providing facility to at least reproduce their behavior, if the user desires, would be useful nonetheless. Replicating published results which use this behavior would be one example.
As a design choice, I think if the goal of fastsurvey
is mainly to be a faster (backend) version of survey
, then I'd go a step further and suggest still that having the same default behavior would be appropriate. Users who try switching from survey
to fastsurvey
(as I did) may not think to compare results with the original package, and unknowingly move forward with results built on an alternatively opinionated variance estimation. One way to address this may be to add a warning or message when the adjust
option is used, which references the issue and suggests one or more alternatives. If the maintainers of survey
and Stata are alerted to this and subsequently change their behavior, default or otherwise, then fastsurvey
could follow suit.
Your blog post asks whom to contact on the Stata end. I think Nick Cox is likely the right person. He's quite active on both SO and the statalist forums. I would suggest making a post on one of those, if you could. I would be curious to hear his thoughts on this and would like to follow along.
I appreciate those points and your suggestion about contacting Nick Cox- that’s definitely helpful. Some context that might be helpful is that this GitHub repo is really intended to help Thomas incorporate the faster C++ functions directly into the ‘survey’ package and for me to develop the C++ code and unit testing for it. Thomas has mostly finished updating survey to use the C++ code and accompanying tests working on adding this in. The reason I opened this ticket is because we have many unit tests to ensure the C++ variance estimation functions match their base R counterparts, and this is the only area where the unit tests have identified an inconsistency. This issue with domain lonely PSUs is really the last lingering issue before the C++ integration is settled.
Relatedly, Thomas also updated the development version of ‘survey’ to fix the bug I wrote about in that post. When the update to ‘survey’ goes on CRAN, there will be a new package option that allows the user to toggle between using C++ or base R under the hood, but with 95% of the time users sticking with the new default C++ computations.
Hopefully that context addresses the concerns you’re expressing about reproducibility and results matching.
Thank you for the context. I'm really glad to hear the original package is being updated with the new backend.
Do you know if it will be possible to reproduce the existing/bugged adjust
values while still benefitting from the C++ speed and memory improvements? That might be useful in some contexts.
The behavior for everything but totals is not affected at all by using C++ or by using the fixed version of the survey package that is coming out: you will still get the same results for svymean()
, svyglm()
, etc.
However, when there are lonely PSUs, you will get different variance estimates for totals from svytotal()
. There will be no way to reproduce older incorrect variance estimates for totals while using C++. I'd suggest that that's fine, because the existing/bugged adjust
behavior for svytotal()
is pretty badly incorrect (though nowhere near as wildly incorrect as Stata).
Just noting feedback from Thomas here so it's easier for me to keep track of:
I think the best resolution for the remaining issue is to warn in svytotal() -- as you say, the whole concept of 'adjust' isn't really right for totals. There's a bit of an issue in that some functions use svytotal() on influence functions (including the svyVGAM package) because it's simpler than calling svyrecvar() or twophasevar() as appropriate, but that's fixable.
Following up: I think what you have in the C++ code is preferable to the R code. Note that in multistage sampling without replacement $Y^*$ won't necessarily be near zero, because this is all recursive, so $Y^*$ will be an average over all the stage-2 strata within the current stage-1 stratum.
These are all good points. I'll make some suggested updates to a few places to implement these changes and send it Thomas for review.
The updates in 913c5f5 and 774b11ba handles the recentering more elegantly, I think, and in a way that resolves these mismatches.
The idea is that when onestage()
passes the data x
on to onestrat()
, the object x
includes an attribute named "recentering"
that indicates how the cluster totals should be recentered. Then the recentering happens within onestrat()
, after the cluster totals have been calculated and after rows of zeroes have been added to the matrix of cluster totals. This is cleaner I think and also causes the formula used in R to match the formula in C++.
If the user doesn't specify lonely.psu = "adjust"
, then the value of attr(x, "recentering")
is simply 0
.
I'll update Thomas to let him know the two functions match now, and check how to go about handling the svytotal()
warning.
Sorry to necro - I hope this is an OK place to post this. Is it still your finding that the bugged behavior in the current CRAN version of survey
, and in Stata, only applies to svytotal()
and not svyglm()
, svymean()
, etc? If I'm reading the literature correctly (e.g., here) I understand that the recommended handling of domain lonely PSUs in the context of, say, a regression model with some missingness in a relevant variable, is to ignore options(survey.lonely.psu)
and options(survey.adjust.domain.lonely)
and instead use survey
's subset
argument (or Stata's equivalent subpop
). However, it seems that you still get the same SEs as if you pre-filtered the dataset and used options(survey.lonely.psu = 'adjust')
.
With filter + adjust
:
library(tidyverse)
library(broom)
library(survey) # most recent CRAN version (may 2023)
data(api)
filtered <- apiclus1 %>%
filter(dnum != 413) %>%
slice(-31) # from tests/lonely.psu.R
options(survey.lonely.psu = 'adjust')
svyglm(growth ~ mobility, design =
svydesign(id = ~1, weights = ~pw, strata = ~dnum, data = filtered)) %>%
tidy %>% pull(std.error)
#> [1] 3.0112312 0.1362315
Stata equivalent:
library(RStata)
options("RStata.StataPath" = "C:/Programs/Stata18/StataMP-64")
options("RStata.StataVersion" = 18)
stata(
paste("svyset [weight=pw], strata (dnum) singleunit(centered)",
"svy: reg growth mobility",
"regsave, tstat pval ci",
sep = "\n"
), data.out = TRUE, data.in = filtered, stata.echo = FALSE
) %>% pull(stderr) %>% rev
#> [1] 3.0112312 0.1362315
Unfiltered, with missingess added to column:
unfiltered <- apiclus1 %>%
filter(dnum != 413) %>%
# instead of removing row 31, we introduce column missingness for it
mutate(growth = if_else(row_number() == 31, NA_real_, growth))
svyglm(growth ~ mobility, design =
svydesign(id = ~1, weights = ~pw, strata = ~dnum, data = unfiltered)) %>%
tidy %>% pull(std.error)
#> [1] 3.0112312 0.1362315
# lonely PSU option is ignored:
options(survey.lonely.psu = 'certainty')
svyglm(growth ~ mobility, design =
svydesign(id = ~1, weights = ~pw, strata = ~dnum, data = unfiltered)) %>%
tidy %>% pull(std.error)
#> [1] 3.0112312 0.1362315
# with subset argument:
svyglm(growth ~ mobility, subset = !is.na(growth), design =
svydesign(id = ~1, weights = ~pw, strata = ~dnum, data = unfiltered)) %>%
tidy %>% pull(std.error)
#> [1] 3.0112312 0.1362315
Stata equivalent:
stata(
paste("svyset [weight=pw], strata (dnum)",
"svy, subpop(if growth != .): reg growth mobility",
"regsave, tstat pval ci",
sep = "\n"
), data.out = TRUE, data.in = unfiltered, stata.echo = FALSE
) %>% pull(stderr) %>% rev
#> [1] 3.0112312 0.1362315
Different result with options(survey.lonely.psu = 'adjust')
and options(survey.adjust.domain.lonely = TRUE)
:
options(survey.lonely.psu = 'adjust')
options(survey.adjust.domain.lonely = TRUE)
svyglm(growth ~ mobility, subset = !is.na(growth), design =
svydesign(id = ~1, weights = ~pw, strata = ~dnum, data = unfiltered)) %>%
tidy %>% pull(std.error)
#> Warning in onestrat(x[index, , drop = FALSE], clusters[index], nPSU[index][1],
#> : Stratum (406) has only one PSU at stage 1
#> Warning in onestrat(x[index, , drop = FALSE], clusters[index], nPSU[index][1],
#> : Stratum (406) has only one PSU at stage 1
#> Warning in onestrat(x[index, , drop = FALSE], clusters[index], nPSU[index][1],
#> : Stratum (406) has only one PSU at stage 1
#> [1] 3.0189884 0.1363199
Am I right in assuming that the subset
/subpop
behavior above is correct, and unaffected by the bug you've found with adjust
? I'm puzzled because, if the normal recommendation is to avoid pre-filtering data, and to use the subset
/subpop
arguments instead, we seem to get the same thing as if we did do the filtering and used adjust
/centered
, at least for these regression model estimates.
Here's some useful background that might be helpful in making sense of this. With complex survey data, proper subsetting (or dropping observations with missing values) is handled a special way: the weights for filtered-out observations are just set to 0, and then the analysis software chugs along and does its variance computations keeping all the observations but using a weight of 0 for the "filtered-out" observations. Here's a short StackExchange answer (based on a nice blog post from Thomas Lumley) on that point:
https://stats.stackexchange.com/a/550867/94994
That's how domain estimation works: you set the weights to zero for everything outside of the domain. Unless you explicitly set the option options(survey.adjust.domain.lonely = TRUE)
, the 'survey' package doesn't make any adjustments for domain lonely PSUs (i.e., situations where a stratum has 2+ PSUs but only one of those PSUs has any nonzero weights).
That's why the first two code examples you provided give the same results.
library(tidyverse)
library(broom)
library(survey) # most recent CRAN version (may 2023)
data(api)
options(survey.lonely.psu = 'adjust')
filtered <- apiclus1 %>%
filter(dnum != 413) %>%
slice(-31) # from tests/lonely.psu.R
unfiltered <- apiclus1 %>%
filter(dnum != 413) %>%
# instead of removing row 31, we introduce column missingness for it
mutate(growth = if_else(row_number() == 31, NA_real_, growth))
filtered |>
svydesign(data = _, id = ~1, weights = ~pw, strata = ~dnum) |>
svyglm(growth ~ mobility, design = _) |>
SE()
#> (Intercept) mobility
#> 3.0112312 0.1362315
unfiltered |>
svydesign(data = _, id = ~1, weights = ~pw, strata = ~dnum) |>
svyglm(growth ~ mobility, design = _) |>
SE()
#> (Intercept) mobility
#> 3.0112312 0.1362315
Independently of the above, which I think addresses the main concern you had, the bug I mentioned in that one blog post applies in either of the following scenarios:
id = ~ PSU + SSU
), and the lonely PSU occurs in the second stage.The first scenario is much more common and the error will tend to be much worse in that scenario. The second stage is way less of a problem but there is still an unwanted bug.
The good news for R anyway is that the bug is fixed in the upcoming version of the 'survey' package, available on the R-Forge repository that Thomas Lumley uses: https://r-forge.r-project.org/scm/?group_id=1788 or in a GitHub mirror of that repository: https://github.com/bschneidr/r-forge-survey-mirror
If you want to install the fixed version, you can use the following:
install.packages("survey", repos="http://R-Forge.R-project.org")
Thanks very much for your responses! They're very helpful. I took the models above and ran a number of different configurations, including the different versions of 'survey', to help me grock the bigger picture. I've condensed column names to make things fit but hopefully it's still interpretable. se.glm
here is the SE of the $mobility
coefficient in svylglm(growth ~ mobility)
from before, and se.total
is the SE of svytotal(~growth)
. If $svy.subset
is TRUE
then svyglm()
gets the subset
argument as before and svytotal()
gets its svydesign
wrapped with subset()
, as Lumley does in his blog post.
svy.lonely.psu svy.adj.dmn.lonely svy.subset prefiltered se.glm_4.2.1 se.glm_4.3 se.total_4.2.1 se.total_4.3
1 fail FALSE TRUE FALSE 0.136 0.136 13057. 13057.
2 fail FALSE FALSE TRUE NA NA NA NA
3 fail FALSE FALSE FALSE 0.136 0.136 13390. NA
4 fail TRUE TRUE FALSE 0.136 0.136 13057. 13057.
5 fail TRUE FALSE TRUE NA NA NA NA
6 fail TRUE FALSE FALSE 0.136 0.136 13390. NA
7 certainty FALSE TRUE FALSE 0.136 0.136 13057. 13057.
8 certainty FALSE FALSE TRUE 0.136 0.053 12903. 8014.
9 certainty FALSE FALSE FALSE 0.136 0.136 13390. NA
10 certainty TRUE TRUE FALSE 0.136 0.136 13057. 13057.
11 certainty TRUE FALSE TRUE 0.136 0.053 12903. 8014.
12 certainty TRUE FALSE FALSE 0.136 0.136 13390. NA
13 remove FALSE TRUE FALSE 0.136 0.136 13057. 13057.
14 remove FALSE FALSE TRUE 0.136 0.053 12903. 8014.
15 remove FALSE FALSE FALSE 0.136 0.136 13390. NA
16 remove TRUE TRUE FALSE 0.136 0.136 13057. 13057.
17 remove TRUE FALSE TRUE 0.136 0.053 12903. 8014.
18 remove TRUE FALSE FALSE 0.136 0.136 13390. NA
19 average FALSE TRUE FALSE 0.136 0.136 13057. 13057.
20 average FALSE FALSE TRUE 0.141 0.055 13390. 8317.
21 average FALSE FALSE FALSE 0.136 0.136 13390. NA
22 average TRUE TRUE FALSE 0.141 0.055 13390. 8317.
23 average TRUE FALSE TRUE 0.141 0.055 13390. 8317.
24 average TRUE FALSE FALSE 0.141 0.055 13390. NA
25 adjust FALSE TRUE FALSE 0.136 0.136 13057. 13057.
26 adjust FALSE FALSE TRUE 0.136 0.136 13057. 12928.
27 adjust FALSE FALSE FALSE 0.136 0.136 13390. NA
28 adjust TRUE TRUE FALSE 0.136 0.136 13209. 13066.
29 adjust TRUE FALSE TRUE 0.136 0.136 13057. 12928.
30 adjust TRUE FALSE FALSE 0.136 0.136 13390. NA
Many thanks to you and Dr Lumley for addressing the bug. While the changes from 4.2.1 to 4.3 (at least the ones I've thought through) are helpful, I admit I still have some residual confusion.
If I can trouble you with a few follow up questions: if I'm following you correctly, it seems as though without the subset
argument, 'survey' has no way of distinguishing domain-lonely PSUs from design-lonely PSUs, in which case it assumes all lonely PSUs are domain-lonely. Then if survey.adjust.domain.lonely
is left FALSE
, no adjustment is performed for them, and hence we get the same results for the two models previously discussed (rows 25 and 26 in the table).
If I have that right, then I'm wondering (1) why (incorrectly) pre-filtering the data impacts svytotal()
-- as you'd expect, as the blog post discusses, and as does occur in 4.3 contrasting rows 25 and 26 -- but does not impact the variance estimation in svyglm()
. In other words, svyglm()
doesn't seem to make use of the full design information that you lose if you pre-filter. It may be that I just need to get a better understanding of the delta method as applied to svyglm()
. But I'm also wondering why (2), if no adjustment is being made for domain-lonely PSUs, we can still see differences between "adjust" and "average" even when survey.adjust.domain.lonely
is FALSE
, as in row 20 compared to row 26. It might be that I'm misunderstanding how domain-lonely PSUs are recognized and treated.
I appreciate your patience with me! A big thanks also for remaking the backend. It's a tremendous help -- hard to overstate how much it changes the usability of 'survey' with large models.
The subset
argument of svyglm()
is actually just equivalent to using the subset()
function on the design object before passing it to svyglm()
. You might find it helpful to take a look at the underlying R code for svyglm()
to see how it handles that. But essentially it's equivalent to the following code:
unfiltered |>
svydesign(data = _, id = ~1, weights = ~pw, strata = ~dnum) |>
subset(!is.na(growth)) |>
svyglm(growth ~ mobility, design = _) |>
SE()
The survey package learns about design-lonely PSUs by keeping track of the strata sample sizes recorded at the time the design was created. When the design is created, the survey package counts the number of PSUs in the sample data, separately for each stratum, and records those in design_object$fpc$sampsize
. You can see that here:
unfiltered |>
svydesign(data = _, id = ~1, weights = ~pw, strata = ~dnum) |>
getElement("fpc") |>
getElement("sampsize")
For example, here's the C++ code that identifies design-lonely PSUs:
In contrast, the survey package learns about domain-lonely PSUs by counting the number of PSUs in the subset of data being analyzed. Here's the C++ code that identifies the domain-lonely PSUs:
One thing that you might find helpful for seeing what is being done is to call debug()
on a low-level function like survey:::multistage
to see what data is being passed to it in the different scenarios you're considering. For example, you'll see that incorrectly pre-filtering the data changes the design_object$fpc$sampsize
object being passed to survey:::multistage_rcpp()
or survey:::multistage()
.
For this part:
if no adjustment is being made for domain-lonely PSUs, we can still see differences between "adjust" and "average" even when survey.adjust.domain.lonely is FALSE, as in row 20 compared to row 26. It might be that I'm misunderstanding how domain-lonely PSUs are recognized and treated.
The pre-filtering means that you're getting incorrect values for design_object$fpc$sampsize
, leading R to think that it has a design-lonely PSU rather than a domain-lonely PSU.
For this part:
If I have that right, then I'm wondering (1) why (incorrectly) pre-filtering the data impacts svytotal() -- as you'd expect, as the blog post discusses, and as does occur in 4.3 contrasting rows 25 and 26 -- but does not impact the variance estimation in svyglm(). In other words, svyglm() doesn't seem to make use of the full design information that you lose if you pre-filter. It may be that I just need to get a better understanding of the delta method as applied to svyglm().
There are four key observations which, taken together, should resolve this question:
survey.lonely.psu = "adjust"
handles lonely PSUs by recentering the lonely PSU's value around the grand mean of PSUs.Those four things together explain why you would, correctly, get the exact same variance estimate in row 25 in that table (subsetting without pre-filtering) as in row 26 (pre-filtering and using survey.lonely.psu = 'adjust'
)
The unit tests for domain lonely PSUs (i.e., lonely PSUs from subsetting) are showing disagreement between base R and C++ when using
survey.lonely.psu = 'adjust'
andsurvey.adjust.domain.lonely = TRUE
.I'll follow up here with an explanation of what the two functions are doing differently.