Closed josherrickson closed 2 years ago
1
and 0
be handled in the formula? Does y ~ 1
get replaced with y ~ 1:adopters()
(thus forcing an intercept of 0) or y ~ adopters()
? What if the user explicitly includes the 1
, y ~ 1 + x
, does this become y ~ 1:adopters() + x:adopters()
, or y ~ 1 + x:adopters()
or y ~ adopters() = x:adopters()
? What about 0
? Any special case when user passes something like y ~ 0 + g
?How strict should we be about not allowing the treatment variable in the RHS? Should this provide an error or warning? Right now we do nothing to search for treatment, so we get stuff like
des <- design(z ~ ...)
mod <- lmitt(y ~ z, design =des)
producing the formula of y ~ z:adopters()
. If there's no discrepancy in treatment variable, this is fine (as z == adopters()
), but if there is some issue with the treatment variable weird things can happen.
Do we want to instead replace "z" with "adopters()"? This would also handle scenarios where the user deliberately uses z
, e.g. lmitt(y ~ x:z + g)
.
You raise some good issues. Here are some examples of how lm
(and probably model.frame/model.matrix under the hood) handles a few of these variations. I will mainly note that 1:adopters()
actually is an intercept only model, not a treatment effect without intercept.
dd <- data.frame(y = rnorm(100), z = rbinom(100, size = 1, prob = 0.3))
> lm(y ~ z, data = dd)
Call:
lm(formula = y ~ z, data = dd)
Coefficients:
(Intercept) z
-0.01024 0.17423
> lm(y ~ 1, data = dd)
Call:
lm(formula = y ~ 1, data = dd)
Coefficients:
(Intercept)
0.03854
> lm(y ~ 1:z, data = dd)
Call:
lm(formula = y ~ 1:z, data = dd)
Coefficients:
(Intercept)
0.03854
> lm(y ~ z:1, data = dd)
Call:
lm(formula = y ~ z:1, data = dd)
Coefficients:
(Intercept)
0.03854
> lm(y ~ z + 0, data = dd)
Call:
lm(formula = y ~ z + 0, data = dd)
Coefficients:
z
0.164
> lm(y ~ z - 1, data = dd)
Call:
lm(formula = y ~ z - 1, data = dd)
Coefficients:
z
0.164
> lm(y ~ 0:z, data = dd)
Call:
lm(formula = y ~ 0:z, data = dd)
No coefficients
Would it be less difficult if we also added a main effect for adopters()
? I think we shied away from this because it would make y ~ subgroup
more difficult, but y ~ subgroup1 + subgroup2
is already tricky (and might require something like glht
function from the multcomp
library to get appropriate tests).
It seems like y ~ z + z:(original RHS)
would result in the expected result most of the time.
> lm(y ~ z + z:1, data = dd)
Call:
lm(formula = y ~ z + z:1, data = dd)
Coefficients:
(Intercept) z
-0.01024 0.17423
> lm(y ~ z + z:0, data = dd)
Call:
lm(formula = y ~ z + z:0, data = dd)
Coefficients:
z
0.164
My $0.02 on the implementation questions in in Josh's first comment below the top of this thread:
y ~ 1
just as we would y ~ adopters()
.
b) The ideal interpretation of y ~ 0
should flow from what we decide about that main case, i.e. y ~ adopters()
. Suppose that we decide to interpret this as in an lm()
call with factor treatment variable that has its contrasts set to treatment contrasts (the default). (Adding a main effect for treatment, as I take Mark to be suggesting we do.) Then it makes sense to have y~0
be interpreted as y ~ adopters() + 0
would. (See code example just below.)Code example of how the +0
gets interpreted when it modifies a factor variable as opposed to a numeric, building on Mark's:
dd <- data.frame(y = rnorm(100), z = rbinom(100, size = 1, prob = 0.3))
> lm(y~as.factor(z)+0, data=dd)
Call:
lm(formula = y ~ as.factor(z) + 0, data = dd)
Coefficients:
as.factor(z)0 as.factor(z)1
0.0940 -0.0657
whereas a numeric z
would have given, as Mark showed,
> lm(y~z+0, data=dd)
Call:
lm(formula = y ~ z + 0, data = dd)
Coefficients:
z
-0.0657
Looking at how subgroups affect things, I like the output we get from
a. having the treatment assignment variable go to lm()
as a factor;
b. taking off the intercept by default, making +0 a no-op, perhaps with an option to include it using a +1
;
c. having a single term sg
on the RHS of the lmitt()
formula go to lm()
as a formula with RHS z:sg
;
d. having multiple terms sg1
, sg21
on the RHS of the lmitt()
formula go to separate calls of lm()
(or something functionally equivalent), with RHS's sg1
, sg2
, later to have their artifacts stitched together into a single object (of type c("lmitt", "lm")
if we determine that to be feasible, or something else if not).
dd1 <- data.frame(y = rnorm(100),
zfac = factor(rbinom(100, size = 1, prob = 0.3)),
sg=factor(sample(rep_len(letters[1:4], 100)))
)
lm(y ~ zfac:sg+0, data=dd1)
##'
##' Call:
##' lm(formula = y ~ zfac:sg + 0, data = dd1)
##'
##' Coefficients:
##' zfac0:sga zfac1:sga zfac0:sgb zfac1:sgb zfac0:sgc zfac1:sgc zfac0:sgd
##' -0.0927 -0.4591 -0.0784 0.2910 -0.2522 -0.5400 0.4486
##' zfac1:sgd
##' 0.2906
lm(y ~ zfac:sg, data=dd1)
##'
##' Call:
##' lm(formula = y ~ zfac:sg, data = dd1)
##'
##' Coefficients:
##' (Intercept) zfac0:sga zfac1:sga zfac0:sgb zfac1:sgb zfac0:sgc
##' 0.290564 -0.383262 -0.749661 -0.368982 0.000453 -0.542767
##' zfac1:sgc zfac0:sgd zfac1:sgd
##' -0.830609 0.157993 NA
lm(y ~ zfac +0 , data=dd1)
##'
##' Call:
##' lm(formula = y ~ zfac + 0, data = dd1)
##'
##' Coefficients:
##' zfac0 zfac1
##' -0.0252 -0.0610
lm(y ~ zfac, data=dd1)
##'
##' Call:
##' lm(formula = y ~ zfac, data = dd1)
##'
##' Coefficients:
##' (Intercept) zfac1
##' -0.0252 -0.0358
What I like about the +0
is that it gets us away from having to report the Intercept. Implicitly it also gets us away from having to report covariances with the intercept, something I expect to be a big plus for reporting of design-based standard errors.
Additional comments:
i. Leaving a number of issues untouched here, partly for time but mostly b/c I want to think through.
ii. Ultimately I'd prefer that lmitt()
not present the user with summaries of the model using the string adopters
to label treatment coefficients. If feasible, it would be nice to present coefficients labeled only the level of the treatment: i.e. if the treatment assignment variable is "assmt" with levels "ctl", "lo", "hi", then y ~ 1
leads to 2 coeff's labeled "lo" and "hi", not "adopters():lo" and "adopters():hi" (nor "assmt:lo" and "assmt:hi". (If it's necessary that the coefficient name begin with a character rather than a number -- is it? -- then for assigments variables given as numerics perhaps we could prepend a string such asalev
or A:
, where the "a" or "A" stands for Assignment.)
(Moving an important sub-thread of this discussion over to #59...)
Resolved just now, with @josherrickson :
adopters()
synonym assigned()
(more meaningful in cross-sectional analysis contexts)a.()
, after checking for trouble w/ potential generic functions a()
. (If a.()
doesn't work we'll consider a_()
)a.()
then we might also allow z.()
. (If wind up with a_()
or similar, then let's definitely have a parallel form involving "z".) With these changes in place we may decide to have our default recommendation for requesting main effects-only analysis with lmitt.formula()
be y~assigned()
or y~a.()
, rather than y ~ 1
or y~0
. The assigned()
/a.()
makes the role of treatment assignment (as opposed say to treatment received) explicit, and makes somewhat more explicit the fact that it's being provided by the design.
(JE I heard your suggestion of a new issue, but thought the matter closely enough tied to this issue that we could group it with this one. If you see it as a new issue after reviewing in this context, feel free to start one up. And while you're add it to close out this one, if that seems appropriate.)
Do we want to ditch adopters()
and stick with assigned()
as the canonical version?
I think my pref would be to keep adopters()
, as it's the natural term for a certain context involving panel data, but potentially promote assigned()
to the primary/canonical name for the function.
I'd be interested to hear any of @markmfredrickson's thoughts about these nomenclature issues, if he's in a position to look into the matter. Mark, the idea is that we'd call for main effects using formulas like y ~ assigned()
(or perhaps y ~ a.()
). Absent further fiddling on our part, the names of coefficent of interest would then also contain the assigned
string (see this comment to #59 for current proposal on what we'd be returning in cases with a subgrouping variable).
I've made assigned
the "default" (mostly just through documentation) with aliases for adopters
, a.
and z.
)
I think this issue has been mostly resolved; we now do :
interaction with any variables in the formula. Some of the subtopics in this issue are now found elsewhere; e.g. #73, #59, #69.
Following up on #40, in that thread we settled on the following:
adopters()
in the RHS of the formula forlmitt.formula()
, do not amend it.adopters()
, replace all variables on RHS withx:adopters()
.E.g. if the user calls:
then the model estimated will be
This thread is to discuss whether this is the desired approach and any issues surrounding this modification of the formula.