factor treatment variables

benthestatistician commented 2 years ago

We'd like to accommodate factor treatment variables. This issue head may be updated with to-do's, but for the moment it's a placeholder for discussion of what to do with these variables and how.

benthestatistician commented 2 years ago

By default, given a factor RHS variable lm() and friends invoke contr.treatment, singling out a baseline level and then contrasting each of the others with it. This is good behavior for us. If we should offer mechanisms for identifying the control condition, they might use the base= argument of contr.treatment().

The default for ordered factors is different (see getOption("contrasts")), and probably less desirable for us, if only because it's difficult to make sense of. Perhaps when lmitt() invokes lm() it should pass down our owncontrasts=. For now we can use contr.treatment for both ordered and unordered factors, if ultimately we may choose to introduce a different behaviour for ordered factors. If so, we might also give lmitt.formula() its own contrasts= arg, to allow users to control this if they so choose.

If we need to associate a default dichotomy with Designs having factor treatment variables, I think it should be the distinction between the base level and all other levels.

josherrickson commented 1 year ago

Just a note that presently, after our change to NOT removing intercept from lmitt.formula, if you pass in a treatment as factor (currently only works for strings, see #126, specifically this comment), you get this:

> simdata$o <- as.character(simdata$o)
> des <- rct_design(o ~ cluster(cid1, cid2), data =simdata)
> lmitt(y ~ 1, design = des, data = simdata)
      o.2       o.3       o.4 
0.5216413 0.2173735 0.2239143 
> lmitt(y ~ 1, design = des, data = simdata)$coeff
(Intercept)         o.2         o.3         o.4 
 -0.2917833   0.5216413   0.2173735   0.2239143

josherrickson commented 1 year ago

We've disabled factor treatment variables in lmitt (#126) for now so my previous comment is no longer relevant.

benthestatistician commented 2 weeks ago

To investigate: can the multcomp:::mmm() mechanism be used as a workaround to estimate separate treatment effects for factor treatment variables that represent multiple treatment groups and a single control? Relates to this comment on #79. The "multiple comparisons" example in the following poster indicates what I have in mind: bbh_acic24_tikz.pdf

benthestatistician commented 4 days ago

The current estfun_DA spec calls for us in some circumstances to rename coefficients currently labeled with "(Intercept)". The multcomp::mmm()-based handling of factor treatment variables that @jwasserman2 and I have been discussing/alluding to here would be assisted by doing a bit more of that. When lmitt.f() has used a dichotomy, how about returning coefficients with names specifying the two sides of the dichotomy?

Currently we have

data(simdata)
spec <- rct_spec(o ~ unit_of_assignment(uoa1, uoa2), data = simdata)
lmitt(y~1, data=simdata, spec=spec, weights="ate", dichotomy = o >=3 ~ o==1) |> 
coef()
##* (Intercept)          o. 
##*     -0.291       0.222

I'm suggesting this output instead be

       o==1       o >= 3 
     -0.291       0.222

or maybe

       [o==1]      [o >= 3] 
     -0.291       0.222

Note that the relevant character strings are retreivable from the dichotomization formula through maneuvers along the following lines:

fmla1  <- as.formula("o >=2 ~ o==1")
foo <- 1:3
names(foo)  <- as.character(unlist(as.list(fmla1)))
foo[2:3]
##* o >= 2 o == 1 
##*      2      3

@josherrickson

josherrickson commented 4 days ago

How would it handle dichotomy = o >= 3 ~ .?

Strictly from a usability perspective, I don't like this. It makes it unnecessarily hard to access entries in the vector. What if we leave the names of the actual coefficient vector intact, but update show.teeMod to include this in the output?

benthestatistician commented 4 days ago

Revising my proposal to affect only the intercept, so that the coefficient vector in my example would be something along the lines of

##*     [o==1]          o. 
##*     -0.291       0.222

instead of

##* (Intercept)          o. 
##*     -0.291       0.222

~~If we wanted to~~ When we use multcomp::mmm() to concatenate coefficients from the three parallel regressions with dichotomy = o==4 ~ o<=3, dichotomy = o>=3 ~ o<=2 and dichotomy = o>=2 ~ o==1 --- the use case bringing me to post about this idea under the banner of this issue -- ~~we'd still need a separate~~ we can lean on mmm()'s naming scheme for marking the distinction between the three regressions' distinct "o." effect coefficients.

To briefly address JE's question, there would have to be a separate solution for examples like dichotomy = o >= 3 ~ .. Perhaps something along the lines of

##*   [!(o >=3)]        o. 
##*     -0.291       0.222

When we're just looking at a the output of a single dichotomy and not trying to concatenate them, this does make the first entry harder to pick out than would "(Intercept)" or anything else that doesn't vary. But it brings the significant benefit of a more informative name there, and the revised proposal avoids imposing the same cost on the effect parameter.

benthestatistician commented 3 days ago

I'm actually interested in doing a bit more than just renaming the intercept coefficient: I'd like to:

Add a coefficient representing the control-group mean of cov_adj() predictions, if such predictions were used, or NA_real_ if no such predictions were used;
In cases where the linear model fit returns a meaningful intercept, add the control-group mean of cov_adj predictions (or 0 if no such predictions were used) on to it to recover the control group mean of the dependent variable.
Meaningful intercepts aren't returned by the linear model fit internal to lmitt.formula(absorb=T). In these cases compute the control-group mean of y's directly, using the same one-way fixed effects weights as would have been used to figure the weighted mean of cov_adj predictions.
Take the intercept out of the coefficient vector, adding back in the calculated means of y and of cov_adj predictions. With a subgroup moderator variable, I'd add such a pair of coefficients for each subgroup.

(Part of the reason for my comment here and just abov is that these changes would make "(Intercept)" an inappopriate and misleading coefficient label.)

As a result of this we'd need labels for 2+ new coefficients, not just one. In this context it may be more important that the coefficient labels clearly indicate the distinction between the average of actual y's and the average of predictions of y, than that they indicate the composition of the control group. But I like having the name mark that they're averages of controls. Here are two naming patterns that would be consistent with these objectives: a.

[y|!a.]     [cov_adj()|!a.]        o.
-0.291                  NA        0.222

b.

[y|!(o>=3)]     [cov_adj()|!(o>=3)]        o.
   -0.291                       NA      0.222

jwasserman2 commented 3 days ago

@benthestatistician I don't think this most recent comment is appropriate here. There's a clear task for this issue and #79 and we've discussed a serviceable solution: we'll handle factor treatment variables (this issue) by stitching together teeMods fit with different dichotomies using the mmm() function (issue #79). Discussing the names of coefficients returned by lmitt() and the intercepts that are returned only muddies what the two issues here are about. I propose keeping discussion of intercepts to #187, #188, or a new issue all together

benthestatistician commented 2 days ago

There are plusses and minuses, I think, to those other places that musings about coefficient names might have been put, just as this one has its drawbacks. Weighing in favor of this location is the fact that the intercept coefficients form a piece of the solution to the factor-treatment variable problem that we seem to be converging on in this discussion, and an aspect where I think we have the potential to add value over what say plain-vanilla lm() would provide. But I agree with you that the intercept issue can stand on its own, so I've opened another issue (#195) as suggested by @jwasserman2.

(Josh W I'm happy to hear you say the task and solution for this issue are clear, because I wasn't sure any of that had converged. In my understanding we'd reached a clear enough vision of something that might work to handle factor treatment variables, with a few extra steps on the part of the user, and you had added it to your list to investigate whether the envisioned solution really was likely to be workable. Am I correct to hear some confidence about feasibility in your comment? If so -- and you needn't have 100% confidence, of course -- then I'll invite you to revise the header of this issue to reflect remaining steps, and also to remove the "Discussion" tag from the issue.)

jwasserman2 commented 1 day ago

@benthestatistician Yes, I was able to code up a solution for model-based variance estimation. As I've noted on PR #197 , the design-based variance estimation routine isn't quite ready for it. I'll attach this issue and issue #79 to PR #197 and remove the "Discussion" tag

benbhansen-stats / propertee

factor treatment variables #78