Supporting `subset = treatment == 1`

josherrickson commented 1 year ago

We currently get NA's if a user attempts to call lmitt.formula with subset = z == 1 style subsetting; e.g. fitting a model only on the treated.

@benthestatistician Is this a workflow we want to support? (For reference, this is coming up in my efforts to coerce current flexida work to support my dissertation models.)

If yes: We need to figure out how to support this. The model we fit should likely be different, or at least what we report needs to be different.
If no: A better error? We have checks in places to ensure the treatment variable is not constant (I believe at the point of Design creation); should we add an informative error during lmitt.formula after subsetting?

The other coefficients in the model appear consistent:

> data(simdata)
> des <- obs_design(z ~ cluster(cid1, cid2), data = simdata)
> mod <- lmitt(y ~ as.factor(o), data = simdata,  design = des, subset = z == 1)
> mod
`z._as.factor(o)1` `z._as.factor(o)2` `z._as.factor(o)3` `z._as.factor(o)4` 
                NA                 NA                 NA                 NA 
> mod$coeff
       (Intercept)    `as.factor(o)2`    `as.factor(o)3`    `as.factor(o)4` 
        0.06778879         0.04866071                 NA         0.07043610 
`z._as.factor(o)1` `z._as.factor(o)2` `z._as.factor(o)3` `z._as.factor(o)4` 
                NA                 NA                 NA                 NA 
> summary(mod)

Call:
lmitt(y ~ as.factor(o), data = simdata, design = des, subset = z == 
    1)

Treatment Effects: (5 not defined because of singularities)
                   Estimate Std. Error t value Pr(>|t|)
`z._as.factor(o)1`       NA         NA      NA       NA
`z._as.factor(o)2`       NA         NA      NA       NA
`z._as.factor(o)3`       NA         NA      NA       NA
`z._as.factor(o)4`       NA         NA      NA       NA

> mod2 <- lm(y ~ as.factor(o), data = simdata, subset = z == 1)
> mod2$coeff
  (Intercept) as.factor(o)2 as.factor(o)4 
   0.06778879    0.04866071    0.07043610

If we add an offset model, we get an additional error:

> camod <- lm(y ~ x, data = simdata, subset = z == 0)
> summary(mod)
Error in chol2inv(x$qr$qr) : 
  element (4, 4) is zero, so the inverse cannot be computed

(Note that as.lmitt doesn't work right either;

> as.lmitt(mod2)
Error in as.lmitt(mod2) : 
  `assigned()` or its aliases are not found in the model formula. `assigned()` needs to be found in place of the treatment variable name. The `lmitt()` function may be used to avoid explicitly indicating `assigned()`.

because we obviously can't include the treatment variable in such a model.)

benthestatistician commented 1 year ago

My view is that we should not attempt to support this usage, despite its having served us nicely in @josherrickson's dissertation project. Instead, I think we should route users to the nearest approximation that we currently support.

In Josh's dissertation, the covariance sample C equals the subset of the experimental/quasiexperimental sample Q that was assigned to control. In this context, if the covariance model is an lm or glm then its (response) residuals sum to 0 over the control group. In consequence, the treatment effect and treatment-moderator interaction that lmitt.f() gives you without the subset = z == 1 are the same as the intercept and moderator main effect that lm(<...>, subset = z == 1) would have given you. So I think we should be directing users to lmitt.f() w/o subset=.
If in this setup the covariance model is anything other than a glm, say a robust or somehow penalized glm, then those control group residuals probably sum to something a little different from 0, and the equivalence of (1) above fails. But then also the coefficients that better bear interpretation as treatment effects are the treatment and treatment-moderator interaction coefficients that would be reported by lmitt.f() without the subset = z == 1.
Similarly, in case where C is a superset of the subset of Q was assigned to control, the equivalence noted in (1) also fails; and again it's the treatment and treatment-moderator interaction coefficients reported by lmitt.f() with no subset = modifier that I think we want to be giving people.

benthestatistician commented 1 year ago

PS: In answer the "if no" question in the issue statement: Yes I'd certainly support an informative error in this circumstance. However, I'm inclined to say it's not something that merits a lot of coding attention unto itself. (If we stay on message, in package documentation, presentations and papers, about the lmitt() without subset= being the way to estimate treatment interactions with a prognostic score, then I see it as unlikely that users would stumble down the lm(<...>, subset = z == 1) path of their own accord.) These views are less congealed than my view on the main yes/no question.

Perhaps the info being assembled for #119 and/or PR #127 could supply raw materials for a useful error message. Those subprojects call for creating tables of numbers of clusters in which different subgroups have representation (when we've been asked to estimate subgroup effects). These counts may be being broken out by block. Whether or not that is the case, if they're being broken out by assignment to treatment versus control, then the tables could be used to determine whether the user has given a subgroup= specification that effectively precludes treatment-control comparisons. While #119 was about subgroups, I imagine that Josh W's code enhancements to address it would be extendable to provide similar tables even when there is no moderator variable, or a continuous moderator. (If they aren't already doing that, which they may be; this may be right there in the PR, but I'm not at liberty to check right at the moment.)

jwasserman2 commented 1 year ago

The code in PR #127 checks these group counts within the vcovDA() call, but we could move these checks to the lmitt() call and perform them for other columns as well. In fact, my response to your comment on that PR @benthestatistician somewhat suggests something along those lines. Since we don't show the coefficients for the subgroups and those are the ones that won't be NA'd by lm() calls, we could run the checks here after the model's been fit and replace the NA'd subgroup x treatment effects with their respective subgroup coefficients while also producing a warning about it.

josherrickson commented 1 year ago

I think between the support for continuous moderators in #81 and Josh's work, I'm good to close this.

benbhansen-stats / propertee

Supporting `subset = treatment == 1` #128