Open benthestatistician opened 2 years ago
By default, given a factor RHS variable lm()
and friends invoke contr.treatment
, singling out a baseline level and then contrasting each of the others with it. This is good behavior for us. If we should offer mechanisms for identifying the control condition, they might use the base=
argument of contr.treatment()
.
The default for ordered factors is different (see getOption("contrasts")
), and probably less desirable for us, if only because it's difficult to make sense of. Perhaps when lmitt()
invokes lm()
it should pass down our owncontrasts=
. For now we can use contr.treatment
for both ordered and unordered factors, if ultimately we may choose to introduce a different behaviour for ordered factors. If so, we might also give lmitt.formula()
its own contrasts=
arg, to allow users to control this if they so choose.
If we need to associate a default dichotomy with Designs having factor treatment variables, I think it should be the distinction between the base level and all other levels.
Just a note that presently, after our change to NOT removing intercept from lmitt.formula
, if you pass in a treatment as factor (currently only works for strings, see #126, specifically this comment), you get this:
> simdata$o <- as.character(simdata$o)
> des <- rct_design(o ~ cluster(cid1, cid2), data =simdata)
> lmitt(y ~ 1, design = des, data = simdata)
o.2 o.3 o.4
0.5216413 0.2173735 0.2239143
> lmitt(y ~ 1, design = des, data = simdata)$coeff
(Intercept) o.2 o.3 o.4
-0.2917833 0.5216413 0.2173735 0.2239143
We've disabled factor treatment variables in lmitt
(#126) for now so my previous comment is no longer relevant.
To investigate: can the multcomp:::mmm()
mechanism be used as a workaround to estimate separate treatment effects for factor treatment variables that represent multiple treatment groups and a single control? Relates to this comment on #79. The "multiple comparisons" example in the following poster indicates what I have in mind: bbh_acic24_tikz.pdf
The current estfun_DA
spec calls for us in some circumstances to rename coefficients currently labeled with "(Intercept)
". The multcomp::mmm()
-based handling of factor treatment variables that @jwasserman2 and I have been discussing/alluding to here would be assisted by doing a bit more of that. When lmitt.f()
has used a dichotomy
, how about returning coefficients with names specifying the two sides of the dichotomy?
Currently we have
data(simdata)
spec <- rct_spec(o ~ unit_of_assignment(uoa1, uoa2), data = simdata)
lmitt(y~1, data=simdata, spec=spec, weights="ate", dichotomy = o >=3 ~ o==1) |>
coef()
##* (Intercept) o.
##* -0.291 0.222
I'm suggesting this output instead be
o==1 o >= 3
-0.291 0.222
or maybe
[o==1] [o >= 3]
-0.291 0.222
Note that the relevant character strings are retreivable from the dichotomization formula through maneuvers along the following lines:
fmla1 <- as.formula("o >=2 ~ o==1")
foo <- 1:3
names(foo) <- as.character(unlist(as.list(fmla1)))
foo[2:3]
##* o >= 2 o == 1
##* 2 3
@josherrickson
How would it handle dichotomy = o >= 3 ~ .
?
Strictly from a usability perspective, I don't like this. It makes it unnecessarily hard to access entries in the vector. What if we leave the names of the actual coefficient vector intact, but update show.teeMod to include this in the output?
Revising my proposal to affect only the intercept, so that the coefficient vector in my example would be something along the lines of
##* [o==1] o.
##* -0.291 0.222
instead of
##* (Intercept) o.
##* -0.291 0.222
If we wanted to When we use multcomp::mmm()
to concatenate coefficients from the three parallel regressions with dichotomy = o==4 ~ o<=3
, dichotomy = o>=3 ~ o<=2
and dichotomy = o>=2 ~ o==1
--- the use case bringing me to post about this idea under the banner of this issue -- we'd still need a separate we can lean on mmm()
's naming scheme for marking the distinction between the three regressions' distinct "o.
" effect coefficients.
To briefly address JE's question, there would have to be a separate solution for examples like dichotomy = o >= 3 ~ .
. Perhaps something along the lines of
##* [!(o >=3)] o.
##* -0.291 0.222
When we're just looking at a the output of a single dichotomy and not trying to concatenate them, this does make the first entry harder to pick out than would "(Intercept)" or anything else that doesn't vary. But it brings the significant benefit of a more informative name there, and the revised proposal avoids imposing the same cost on the effect parameter.
I'm actually interested in doing a bit more than just renaming the intercept coefficient: I'd like to:
cov_adj()
predictions, if such predictions were used, or NA_real_
if no such predictions were used;lmitt.formula(absorb=T)
. In these cases compute the control-group mean of y's directly, using the same one-way fixed effects weights as would have been used to figure the weighted mean of cov_adj predictions. (Part of the reason for my comment here and just abov is that these changes would make "(Intercept)" an inappopriate and misleading coefficient label.)
As a result of this we'd need labels for 2+ new coefficients, not just one. In this context it may be more important that the coefficient labels clearly indicate the distinction between the average of actual y's and the average of predictions of y, than that they indicate the composition of the control group. But I like having the name mark that they're averages of controls. Here are two naming patterns that would be consistent with these objectives: a.
[y|!a.] [cov_adj()|!a.] o.
-0.291 NA 0.222
b.
[y|!(o>=3)] [cov_adj()|!(o>=3)] o.
-0.291 NA 0.222
@benthestatistician I don't think this most recent comment is appropriate here. There's a clear task for this issue and #79 and we've discussed a serviceable solution: we'll handle factor treatment variables (this issue) by stitching together teeMods
fit with different dichotomies using the mmm()
function (issue #79). Discussing the names of coefficients returned by lmitt()
and the intercepts that are returned only muddies what the two issues here are about. I propose keeping discussion of intercepts to #187, #188, or a new issue all together
There are plusses and minuses, I think, to those other places that musings about coefficient names might have been put, just as this one has its drawbacks. Weighing in favor of this location is the fact that the intercept coefficients form a piece of the solution to the factor-treatment variable problem that we seem to be converging on in this discussion, and an aspect where I think we have the potential to add value over what say plain-vanilla lm()
would provide. But I agree with you that the intercept issue can stand on its own, so I've opened another issue (#195) as suggested by @jwasserman2.
(Josh W I'm happy to hear you say the task and solution for this issue are clear, because I wasn't sure any of that had converged. In my understanding we'd reached a clear enough vision of something that might work to handle factor treatment variables, with a few extra steps on the part of the user, and you had added it to your list to investigate whether the envisioned solution really was likely to be workable. Am I correct to hear some confidence about feasibility in your comment? If so -- and you needn't have 100% confidence, of course -- then I'll invite you to revise the header of this issue to reflect remaining steps, and also to remove the "Discussion" tag from the issue.)
@benthestatistician Yes, I was able to code up a solution for model-based variance estimation. As I've noted on PR #197 , the design-based variance estimation routine isn't quite ready for it. I'll attach this issue and issue #79 to PR #197 and remove the "Discussion" tag
We'd like to accommodate factor treatment variables. This issue head may be updated with to-do's, but for the moment it's a placeholder for discussion of what to do with these variables and how.