acoppock / FEDAI

Helper functions and data for the FEDAI textbook
MIT License
1 stars 0 forks source link

Covariate Selection: include imbalanced covariates? #5

Open YinghuiZhouu opened 2 months ago

YinghuiZhouu commented 2 months ago

Mutz et al 2019

Main focus:

Find whether there are statistical evidence to support the argument: “Covariates may be included because their distribution is unbalanced across treatment groups.” (Mutz et al., 2019, p. 34)

Main argument:

Theorem

Let $V(X) = C(X, X)$ denote empirical variance. The estimators

$$ \hat{\beta}_0 = \frac{C(X, Y)}{V(X)} $$

$$ \hat{\beta} = \frac{C(X, Y)V(Z) - C(X, Z)C(Y, Z)}{V(X)V(Z) - C(X, Z)^2} $$

are the result of regressing $Y$ on $X$ alone or on both $X$ and $Z$, respectively.

Suppose that the truth is given by the linear model

$$ Y = \mu + \beta X + \gamma Z + \theta \xi, $$

where $\xi_j$ are independent mean zero unit variance noise terms. Despite the omission of $Z$ from the first model, both $\hat{\beta}_0$ and $\hat{\beta}$ are unconditionally unbiased estimators of $\beta$. The variance of $Y$ explained by $Z$ is $\gamma^2E(Z^2)$, while the variance unexplained by either $X$ or $Z$ is the variance $\theta^2$ of the noise term. The ratio

$$ T := \frac{\gamma^2 V(Z)}{\theta^2} $$

is a parameter of the model measuring the portion of variance of $Y$ explained by $Z$. The following result is proved in the online appendix.

Theorem 1. Given the $X$ and $Z$ variables, let $r^2 = \frac{C(X, Z)^2}{V(X)V(Z)}$. Then $$ E{\ast}(\hat{\beta} - \beta)^2 \leq E{\ast}(\hat{\beta}0 - \beta)^2 $$ if and only if $T > N^{-1}(1 - r^2)$, where $E{\ast}$ denotes conditional expectation with respect to the $X$ and $Z$ variables.

image

“r2 is known before treatment”

“This result says that one should be less inclined to include Z, not more, if you see that Z is imbalanced (r2 is large)”

An intuitive explanation is that collinearity between the treatment and the covariate introduces uncertainty as to which is responsible for any variation predicted in the dependent variable. The threshold, it should be noted, is small, and minimizing the variance of the estimator is far from the only goal. Theorem 1 does not dictate what variables to include, but it does imply that a failed balance test is not a good reason for including a variable in the analysis.

Theorem 2. Consider two estimators of a treatment effect, one the simple difference of means estimator and one post-stratified by a covariate taking the values 0 and 1. The post-stratified estimator will have a lower MSE than the simple estimator if and only if the ratio of variance in the dependent variable predicted by the covariate to the unpredicted variance is greater than

$$ \frac{1}{n} [ab(a + b) + cd(c + d)](a + b)(c + d) \div abcd(a + b + c + d). $$

Here, $a$, $b$, $c$, and $d$ are the respective sizes of the groups of controls with covariate value 0, controls with covariate value 1, treated subjects with covariate value 0, and treated subjects with covariate value 1.

The threshold is somewhat complicated, but it is minimized when the group sizes are equal, and it always grows when the distribution of the covariate within a treatment group becomes more imbalanced. This means that a failed balance test is not a good reason to perform a stratified analysis. If a balance test fails, the threshold for stratification to improve efficiency increases rather than decreases.

Adam Zelizer's R program

Setup:

Main results

image image image image image image
YinghuiZhouu commented 2 months ago

New direction to show (in Chapter 4): Don:

As I understand:

YinghuiZhouu commented 2 months ago

Jiawei: I would first consider what constitutes a better test for imbalance, or what might be a more meaningful null hypothesis for the imbalance test (similar to the parallel trend pre-test in Difference-in-Differences analysis)? I personally agree with the null hypothesis that covariates are imbalanced rather than balanced, as argued in this paper: https://onlinelibrary.wiley.com/doi/full/10.1111/ajps.12387.

YinghuiZhouu commented 2 months ago

To-do from meeting May 10th: