benbhansen-stats / propertee

Prognostic Regression Offsets with Propagation of ERrors, for Treatment Effect Estimation (IES R305D210029).
https://benbhansen-stats.github.io/propertee/
Other
2 stars 0 forks source link

Adding `adopters()` to `lmitt.formula`'s formula #57

Closed josherrickson closed 2 years ago

josherrickson commented 2 years ago

Following up on #40, in that thread we settled on the following:

  1. If user provides adopters() in the RHS of the formula for lmitt.formula(), do not amend it.
  2. If the user does not provide adopters(), replace all variables on RHS with x:adopters().

E.g. if the user calls:

lmitt(y ~ g + h, data = mydata, design = des)

then the model estimated will be

lm(y ~ g:adopters() + h:adopters(), data = mydata, design = des)

This thread is to discuss whether this is the desired approach and any issues surrounding this modification of the formula.

josherrickson commented 2 years ago
  1. How should 1 and 0 be handled in the formula? Does y ~ 1 get replaced with y ~ 1:adopters() (thus forcing an intercept of 0) or y ~ adopters()? What if the user explicitly includes the 1, y ~ 1 + x, does this become y ~ 1:adopters() + x:adopters(), or y ~ 1 + x:adopters() or y ~ adopters() = x:adopters()? What about 0? Any special case when user passes something like y ~ 0 + g?
  2. How strict should we be about not allowing the treatment variable in the RHS? Should this provide an error or warning? Right now we do nothing to search for treatment, so we get stuff like

    des <- design(z ~ ...)
    mod <- lmitt(y ~ z, design =des)

    producing the formula of y ~ z:adopters(). If there's no discrepancy in treatment variable, this is fine (as z == adopters()), but if there is some issue with the treatment variable weird things can happen.

    Do we want to instead replace "z" with "adopters()"? This would also handle scenarios where the user deliberately uses z, e.g. lmitt(y ~ x:z + g).

markmfredrickson commented 2 years ago

You raise some good issues. Here are some examples of how lm (and probably model.frame/model.matrix under the hood) handles a few of these variations. I will mainly note that 1:adopters() actually is an intercept only model, not a treatment effect without intercept.

dd <- data.frame(y = rnorm(100), z = rbinom(100, size = 1, prob = 0.3))
> lm(y ~ z, data = dd)

Call:
lm(formula = y ~ z, data = dd)

Coefficients:
(Intercept)            z  
   -0.01024      0.17423  

> lm(y ~ 1, data = dd)

Call:
lm(formula = y ~ 1, data = dd)

Coefficients:
(Intercept)  
    0.03854  

> lm(y ~ 1:z, data = dd)

Call:
lm(formula = y ~ 1:z, data = dd)

Coefficients:
(Intercept)  
    0.03854  

> lm(y ~ z:1, data = dd)

Call:
lm(formula = y ~ z:1, data = dd)

Coefficients:
(Intercept)  
    0.03854  

> lm(y ~ z + 0, data = dd)

Call:
lm(formula = y ~ z + 0, data = dd)

Coefficients:
    z  
0.164  

> lm(y ~ z - 1, data = dd)

Call:
lm(formula = y ~ z - 1, data = dd)

Coefficients:
    z  
0.164  

> lm(y ~ 0:z, data = dd)

Call:
lm(formula = y ~ 0:z, data = dd)

No coefficients

Would it be less difficult if we also added a main effect for adopters()? I think we shied away from this because it would make y ~ subgroup more difficult, but y ~ subgroup1 + subgroup2 is already tricky (and might require something like glht function from the multcomp library to get appropriate tests).

It seems like y ~ z + z:(original RHS) would result in the expected result most of the time.

> lm(y ~ z + z:1, data = dd)

Call:
lm(formula = y ~ z + z:1, data = dd)

Coefficients:
(Intercept)            z  
   -0.01024      0.17423  

> lm(y ~ z + z:0, data = dd)

Call:
lm(formula = y ~ z + z:0, data = dd)

Coefficients:
    z  
0.164  
benthestatistician commented 2 years ago

My $0.02 on the implementation questions in in Josh's first comment below the top of this thread:

  1. a) it seems OK to me just to interpret y ~ 1 just as we would y ~ adopters(). b) The ideal interpretation of y ~ 0 should flow from what we decide about that main case, i.e. y ~ adopters(). Suppose that we decide to interpret this as in an lm() call with factor treatment variable that has its contrasts set to treatment contrasts (the default). (Adding a main effect for treatment, as I take Mark to be suggesting we do.) Then it makes sense to have y~0 be interpreted as y ~ adopters() + 0 would. (See code example just below.)
  2. Re what to do when user puts intervention variable on RHS, I'm inclined to say do nothing or throw a warning.

Code example of how the +0 gets interpreted when it modifies a factor variable as opposed to a numeric, building on Mark's:

dd <- data.frame(y = rnorm(100), z = rbinom(100, size = 1, prob = 0.3))
> lm(y~as.factor(z)+0, data=dd)

Call:
lm(formula = y ~ as.factor(z) + 0, data = dd)

Coefficients:
as.factor(z)0  as.factor(z)1  
       0.0940        -0.0657  

whereas a numeric z would have given, as Mark showed,

> lm(y~z+0, data=dd)

Call:
lm(formula = y ~ z + 0, data = dd)

Coefficients:
      z  
-0.0657  

Looking at how subgroups affect things, I like the output we get from

a. having the treatment assignment variable go to lm() as a factor; b. taking off the intercept by default, making +0 a no-op, perhaps with an option to include it using a +1; c. having a single term sg on the RHS of the lmitt() formula go to lm() as a formula with RHS z:sg; d. having multiple terms sg1, sg21 on the RHS of the lmitt() formula go to separate calls of lm() (or something functionally equivalent), with RHS's sg1, sg2, later to have their artifacts stitched together into a single object (of type c("lmitt", "lm") if we determine that to be feasible, or something else if not).

dd1 <- data.frame(y = rnorm(100), 
                      zfac = factor(rbinom(100, size = 1, prob = 0.3)),
                      sg=factor(sample(rep_len(letters[1:4], 100)))
                      )
lm(y ~ zfac:sg+0, data=dd1)
##' 
##' Call:
##' lm(formula = y ~ zfac:sg + 0, data = dd1)
##' 
##' Coefficients:
##' zfac0:sga  zfac1:sga  zfac0:sgb  zfac1:sgb  zfac0:sgc  zfac1:sgc  zfac0:sgd  
##'   -0.0927    -0.4591    -0.0784     0.2910    -0.2522    -0.5400     0.4486  
##' zfac1:sgd  
##'    0.2906  

lm(y ~ zfac:sg, data=dd1)
##' 
##' Call:
##' lm(formula = y ~ zfac:sg, data = dd1)
##' 
##' Coefficients:
##' (Intercept)    zfac0:sga    zfac1:sga    zfac0:sgb    zfac1:sgb    zfac0:sgc  
##'    0.290564    -0.383262    -0.749661    -0.368982     0.000453    -0.542767  
##'   zfac1:sgc    zfac0:sgd    zfac1:sgd  
##'   -0.830609     0.157993           NA  

lm(y ~ zfac +0 , data=dd1)
##' 
##' Call:
##' lm(formula = y ~ zfac + 0, data = dd1)
##' 
##' Coefficients:
##'   zfac0    zfac1  
##' -0.0252  -0.0610  
lm(y ~ zfac, data=dd1)
##' 
##' Call:
##' lm(formula = y ~ zfac, data = dd1)
##' 
##' Coefficients:
##' (Intercept)        zfac1  
##'     -0.0252      -0.0358  

What I like about the +0 is that it gets us away from having to report the Intercept. Implicitly it also gets us away from having to report covariances with the intercept, something I expect to be a big plus for reporting of design-based standard errors.

Additional comments:

i. Leaving a number of issues untouched here, partly for time but mostly b/c I want to think through. ii. Ultimately I'd prefer that lmitt() not present the user with summaries of the model using the string adopters to label treatment coefficients. If feasible, it would be nice to present coefficients labeled only the level of the treatment: i.e. if the treatment assignment variable is "assmt" with levels "ctl", "lo", "hi", then y ~ 1 leads to 2 coeff's labeled "lo" and "hi", not "adopters():lo" and "adopters():hi" (nor "assmt:lo" and "assmt:hi". (If it's necessary that the coefficient name begin with a character rather than a number -- is it? -- then for assigments variables given as numerics perhaps we could prepend a string such asalev or A:, where the "a" or "A" stands for Assignment.)

benthestatistician commented 2 years ago

(Moving an important sub-thread of this discussion over to #59...)

benthestatistician commented 2 years ago

Resolved just now, with @josherrickson :

  1. Add adopters() synonym assigned() (more meaningful in cross-sectional analysis contexts)
  2. Add additional synonym a.(), after checking for trouble w/ potential generic functions a(). (If a.() doesn't work we'll consider a_())
  3. If we're OK with a.() then we might also allow z.(). (If wind up with a_() or similar, then let's definitely have a parallel form involving "z".)

With these changes in place we may decide to have our default recommendation for requesting main effects-only analysis with lmitt.formula() be y~assigned() or y~a.(), rather than y ~ 1 or y~0. The assigned()/a.() makes the role of treatment assignment (as opposed say to treatment received) explicit, and makes somewhat more explicit the fact that it's being provided by the design.

(JE I heard your suggestion of a new issue, but thought the matter closely enough tied to this issue that we could group it with this one. If you see it as a new issue after reviewing in this context, feel free to start one up. And while you're add it to close out this one, if that seems appropriate.)

josherrickson commented 2 years ago

Do we want to ditch adopters() and stick with assigned() as the canonical version?

benthestatistician commented 2 years ago

I think my pref would be to keep adopters(), as it's the natural term for a certain context involving panel data, but potentially promote assigned() to the primary/canonical name for the function.

I'd be interested to hear any of @markmfredrickson's thoughts about these nomenclature issues, if he's in a position to look into the matter. Mark, the idea is that we'd call for main effects using formulas like y ~ assigned() (or perhaps y ~ a.()). Absent further fiddling on our part, the names of coefficent of interest would then also contain the assigned string (see this comment to #59 for current proposal on what we'd be returning in cases with a subgrouping variable).

josherrickson commented 2 years ago

I've made assigned the "default" (mostly just through documentation) with aliases for adopters, a. and z.)

josherrickson commented 2 years ago

I think this issue has been mostly resolved; we now do : interaction with any variables in the formula. Some of the subtopics in this issue are now found elsewhere; e.g. #73, #59, #69.