benbhansen-stats / propertee

Prognostic Regression Offsets with Propagation of ERrors, for Treatment Effect Estimation (IES R305D210029).
https://benbhansen-stats.github.io/propertee/
Other
2 stars 0 forks source link

`subset=` in `lmitt` #56

Closed josherrickson closed 2 years ago

josherrickson commented 2 years ago

Currently code for handling subset= argument in lmitt.formula is commented out, due to failing tests.

✖ | 1      17 | test.lmitt [0.1s]
Error (???): Design argument
Error in `model.frame.default(formula = y ~ z:adopters(), data = simdata,
    subset = simdata$dose < 300, drop.unused.levels = TRUE)`: variable lengths differ (found for 'adopters()')
Backtrace:
 1. stats::model.frame(...)
 2. stats::model.frame.default(...)

Looks like a sample size issue.

(Note: Haven't pushed these changes up quite yet.)

josherrickson commented 2 years ago

Nevermind - got myself mixed up.

For future reference - when calling lmitt.formula(), all arguments to lm() are accepted, along with design and dichotomy. While the *_design() functions take a subset argument, this is not the same as the subset argument to lm/lmitt, so users cannot use the shorthand of

lmitt(y ~ x, design = z ~ ..., subset = ...)

in place of

des <- rct_design(z ~ ..., subset = ...)
lmitt(y ~ x, design = des)

(Though, since dichotomy= is accepted in lmitt.formula() and passed to the Design creation, you can use that for subsetting if appropriate, e.g. RD scenarios.)

benthestatistician commented 2 years ago

Could you clarify the last comment, in light of test.lmitt.R including some lines that appear to be testing the use off subset= in lmitt.formula()?

I imagine that in those lines the subset= argument is being accepted (not ignored), but that it only modifies the lm object and not the Design that's created en route to creation of the lm. But perhaps it's something else.

josherrickson commented 2 years ago

Correct. For clarity sake, let's rename the argument in *_design to subset_design=. Then we have

des <- rct_design(z ~ ..., data = ..., subset_design = sub_a)
lmitt(y ~ x, design = des, data = ..., subset = sub_b)

Both subset_design and subset are accepted arguments; subset_design= affects only the creation of the design, whereas subset= is affecting on the lm in the typical way. My comment was that this won't work:

lmitt(y ~ x, design = des, data = ..., subset = sub_b,
      design = z ~ ..., subset_design = sub_a)

because of course both arguments really have the same name. We could discuss renaming the subset= argument to *_design but I don't think we want to do that.

(In case your request for clarity on the "last comment" refers to my statement about dichotomy, in some issue I made a comment about this but can't find it. But essentially there was some discussion about using subset= argument in rdd_design to restrict the design to around the discontinuity, I was pointing out that lmitt accepts and applies a dichotomy= argument so one could do:

rd_design(time ~ cluster(c) + forcing(f), data = mydata,
      dichotomy = time > 10 & time <= 12 ~ time > 12 & time < 14

which would be the same as

rd_design(time ~ cluster(c) + forcing(f), data = mydata,
      dichotomy = time <= 12 ~ time > 12,
      subset = time > 10 & time < 14

from the perspective of treatment assignment. Of course, in the former, clusters outside of the (10, 14) window would be included with NA binary treatment, whereas in the latter, only clusters within (10, 14) would appear. However in both scenarios, when merging treatment information to the analysis level, it would have the same impact.

Then, since dichotomy= is a valid argument to lmitt(), you could use the dichotomy to subset the design creation step, if the subsetting is defined based on treatment levels, as in RD.)

benthestatistician commented 2 years ago

Thanks! And sorry about the ambiguity re "last comment"; I had meant only to seek the first of the clarifications you gave. (I did find the second helpful as well!)

Let me open the door to another followup from @josherrickson, but without expecting it (in case you don't have anything to add). Suppose a user who's looking to declare their design and estimate effects within a single call to lmitt.formula(), as opposed to declaring in a *_design() call and then piping it into lmitt() or lm(). Suppose also that this user intends to do some subsetting within the same step. What situations can you envision where subset_design=time > 10 & time < 14, were we to support that argument, would give a different result than subset=time > 10 & time < 14 being passed to the lm() part of the call?

The only one occurring to me is when we're using ate() or itt(). If the subsetting call removes entire units of assignment, then applying it at the design creation stage changes the weights, whereas if you apply it at the lm estimation stage only then the weights are based off of the entire non-subsetted design, even though some of the units of assignment that contributed to it don't appear. (In this scenario I'd think the user would be well advised to create their design on a separate line of code, for clarity. So if it and others like it are the only case whether there's a difference, I don't see any cause for giving lmitt.formula() a subset.design= argument.)

josherrickson commented 2 years ago

In general, a user could potentially create the Design inside the lmitt call, then extract it for further use down the road, in which case they may want subset and subset_design to be different.

But overall, I agree with your last* pair of sentences, in most scenarios users are better served by creating the Design as a separate step. And I think we should market it that way.

Really design=formula version is only for simple cases I think. I was not advocating allowing both subset and subset_design in lmitt, I opened this issue originally because I was conflating the two and thought things weren't working as expected, but they were hence my quick closure of it.

* Your real last pair of sentences.

benthestatistician commented 2 years ago

I'm not sure I'm ready to go quite as far as "most scenarios" in your statement that

in most scenarios users are better served by creating the Design as a separate step

as there may be users and usage patterns for whom skipping that step works fine most or all of the time. But I'd certainly be on board with "many scenarios".