Open benthestatistician opened 1 month ago
If there are no concerns about this, I'll ask @josherrickson could address this (and #184). First, though, let's hear if there's any discussion.
I'm in favor of this update. Since we have this logic here:
https://github.com/benbhansen-stats/propertee/blob/7d8fb2f9dbabdbe397ac53c17a98374dd6f0231a/R/lmitt.R#L297-L299
I think we may be able to just change the logic at the lines Ben mentioned to make blocks
a vector of 1's (with length given by the number of rows of data
) when there are no blocks in the design
No objections, but is there any reason not to separate these functionality? Rather than have it be a side-effect of calling blocks in a somewhat non-intuitive way?
Yes, I think so. For users who understand "absorb" to be about absorbing one more more intercept terms into other independent variables, one should be able to absorb whether or not one has defined any blocks. So these users would be confused by a stop with error message about their designs not having any blocks.
@josherrickson does it work for you to take this on, perhaps over the next week or two? (...if i was premature in removing the Discussion label, do continue to discuss.)
Yes, I can do this.
Another thought: Instead of updating lmitt
, should we update Design
objects to always have blocks? If the user doesn't input one, we automatically add a column of 1's. We could even suppress reporting it when it's constant.
Sounds like the right approach to me!
I like the idea of focusing more on the Design
object. Instead of adding blocks by default, what are your thoughts on just making the blocks()
accessor return a column of 1's when there are no blocks i.e., changing the code here:
https://github.com/benbhansen-stats/propertee/blob/17c85dad68a6387a42dc8c157bba0c515a2a8111/R/DesignAccessors.R#L391-L397
I think we could (and should) be using this in lmitt.formula()
anyways
We need to be careful about how this affects weight creation given the logic we use (EDIT: actually, unless there are some unforeseen bugs, the weight calculations should still return the correct weights): https://github.com/benbhansen-stats/propertee/blob/17c85dad68a6387a42dc8c157bba0c515a2a8111/R/weights.R#L155-L165 as well as any of @xinhew0708 's variance estimation code
I like the idea of focusing more on the
Design
object. Instead of adding blocks by default, what are your thoughts on just making theblocks()
accessor return a column of 1's when there are no blocks i.e., changing the code here:I think we could (and should) be using this in
lmitt.formula()
anyways
I like this idea. It's a minimal touch, and assuming all tests pass, unlikely to cause issues. It still allows us to check if there's a otherwise defined block variable in the model.
I'll work on getting this implemented tomorrow or early next week.
How do we want this to interact with dtable?
a. Ignore the column of 1's and continue to warn or error. This is the current output when asked for a "blocks" in a block-less design:
> des <- rct_design(z ~ cluster(uoa1, uoa2), data = simdata)
> dtable(des, "block")
< table of extent 0 >
Warning message:
In .get_xydat(x, design) : blocks not found in design
> dtable(des, "treatment", "block")
treatment
0 1
6 4
Warning message:
In .get_xydat(y, design) : blocks not found in design
b. Treat the blocks as 1s, as they come into the function.
If we choose b., I think we need to update everywhere (specifically documentation) that blocks
aren't optional in the sense that we put a constant block if the design is missing it.
On the other hand, a. is easier and doesn't change the overall flow of things. It also appears, using only the testsuite, to be the only place where special casing is required with this change. (I haven't tested if the functionality works as expected everywhere, but this is the only place where the change causes tests to fail.)
A third option: Do we want to try and "hide" this change to blocks from the user? Perhaps add an argument to blocks()
that returns the columns of 1's if blocks aren't specified, otherwise returning as it currently is? Then we can use that argument wherever needed internally, but users aren't affected by this unless they explicitly look for it.
I think I like this option the best.
I went forward with the third option. There's a calculation error to be ironed out:
> des <- rct_design(z ~ uoa(uoa1,uoa2), data = simdata)
> lmitt(y ~ x, data = simdata, design = des, absorb = TRUE)$coeff
(Intercept) z. x z._x
0.03920000 0.09294351 -0.22526910 0.24941721
> lmitt(y ~ x, data = simdata, design = des, absorb = FALSE)$coeff
(Intercept) z. x z._x
0.02674270 0.09294351 -0.22526910 0.24941721
but making progress. Out of time to continue right now, I'll return to this on Wednesday or Friday - with this current plan, unless I hear otherwise.
I'm happy with the third option. The remaining calculation error could be wrapped in with #188, which proposes to add a separate step of re-calculating intercepts to the absorb=T
path generally. (The example helped me catch an error, which I think I've now corrected, in my last comment over there about how to do that re-calculation. Thanks!)
I'm actually not sure whether the result I showed in the last comment is an error or not - it's been long enough since I've thought about the residualization of the intercept that I keep going back and forth in my mind. In the case with subgroups, we expect the intercept to be identically 0. Do we expect that in this case?
Alternatively, do you have a worked example of what the result should look like for comparison?
With no moderator[^1], we expect that if we absorb on the right while also fitting an intercept as well as a treatment variable, then that intercept should be identically 0. However, with a continuous moderator this is not the case, unless that moderator happens to be mean-centered already.
That said, to be clear: Once #188 is implemented, we'll no longer be returning that numerical zero in the (Intercept)
coefficient position. Instead we'll be doing another calculation, an averaging of outcomes under the control group, and returning it there instead. (With a corresponding replacement of that estimating function column.) This alternate intercept calculation will better align with what's returned in the (Intercept)
position after fitting with IPW weights rather than absorption, and faciliate interpretation of the results.
[^1]: The situation with subgroups has some subtleties I can't stop to review, so commenting on main effects instead.
Ok, so then I still have a numeric issue somewhere:
> des <- rct_design(z ~ uoa(uoa1,uoa2), data = simdata)
> lmitt(y ~ 1, data = simdata, design = des, absorb = FALSE)$coeff
(Intercept) z.
-0.0090 0.1205
> lmitt(y ~ 1, data = simdata, design = des, absorb = TRUE)$coeff
(Intercept) z.
0.0392 0.1205
Weird. Maybe I spoke too fast when I said the intercept should be identically 0. Perhaps that's only true if with both sides absorption.
(Again, if this issue is only affecting the intercept then it won't matter after #188.)
Just noting a connection to #69. Point 2 of this comment on that thread is addressed by this issue.
My though is that it's because we're still regressing on an intercept column of 1's and not centering the outcomes. We drop the intercept column that should be all zeros here:
https://github.com/benbhansen-stats/propertee/blob/5976f8d8dadc0d4afeafb6f668a259d97fadc41b/R/lmitt.R#L304
Then we regress with an intercept based on the formula here:
https://github.com/benbhansen-stats/propertee/blob/5976f8d8dadc0d4afeafb6f668a259d97fadc41b/R/lmitt.R#L321-L325
We shouldn't really be looking at the estimated (Intercept)
anyways, right? If we consider the regression like Angrist and Pischke do on pg. 223, there technically isn't any intercept term to regress on, since it's been differenced out
I've merged the branch into main. Regardless of any outstanding calculation concern, it passes tests/checks. @benthestatistician up to you to keep open or close.
Can you add something to the NEWS file?
If absorption is requested when there are no blocks,
lmitt.formula()
is currently set to stop (lmitt.R lines 245-53). I see an advantage instead taking the entire comparison sample to be a single block, then proceeding with the rest of the absorption logic. I.e., partialling out the intercept from the independent variables.