`cov.adj` when `Design` and `lmitt` disagree on clusters

josherrickson commented 1 year ago

If the Design object has fewer clusters than in the data being passed to the model, cov_adj() is returning values for the "missing" clusters.

> nrow(unique(simdata[, c("cid1", "cid2")]))
[1] 10
> des <- obs_design(z ~ uoa(cid1, cid2) + block(bid), data = simdata[1:44, ])
> nrow(units_of_assignment(des))
[1] 9

So now simdata contains one additional cluster which is not found des.

> ate(des, data = simdata)
 [1] 1.333333 1.333333 1.333333 1.333333 1.333333 1.333333 1.333333 1.333333
 [9] 1.333333 1.333333 1.333333 1.333333 1.333333 1.333333 4.000000 4.000000
[17] 4.000000 4.000000 4.000000 4.000000 1.500000 1.500000 1.500000 1.500000
[25] 1.500000 1.500000 1.500000 1.500000 1.500000 1.500000 3.000000 3.000000
[33] 3.000000 3.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000
[41] 2.000000 2.000000 2.000000 2.000000       NA       NA       NA       NA
[49]       NA       NA

The weight functions properly flag the observations with "new" clusters as NA. (assigned() also operates this way.)

> cov_adj(lm(y ~ x, data = simdata), newdata = simdata, design = des)
 [1] -0.267310755  0.205238716 -0.195341394 -0.247757156  0.032220858
 [6]  0.384993830 -0.035601136  0.118595399 -0.080830105 -0.008661778
[11] -0.073181323 -0.012872646 -0.121606785  0.073734839  0.144460133
[16]  0.118466494  0.280779823  0.162540637  0.113210954  0.068329012
[21]  0.088012727  0.314256235  0.153798380 -0.240418620 -0.056459921
[26] -0.252965338  0.076797243  0.045975111  0.059375420  0.205072420
[31]  0.153307569 -0.263790003  0.113668339 -0.150262240  0.183360868
[36]  0.315229132  0.079308421 -0.101459508 -0.130662955 -0.207113765
[41]  0.289720679 -0.258755571  0.133901123  0.010217533 -0.039765163
[46]  0.150680841  0.319963429  0.101756116  0.020841725  0.161521386

This is an issue in the absorb branch. Because we're doing so much manipulation of the objects and formulas, by the time we get to using the offset, the other pieces of the data going into the final lm() call have only e.g. 44 rows. So it errors due to a length mismatch.

Should cov_adj() work this way?
If so, any ideas for workarounds?

benthestatistician commented 1 year ago

Interesting. Followup q's:

a. Is lmitt.R#L160-L166 where you're seeing this problem? Relatedly, b. I wonder why we're only seeing the problem now -- is it really something intrinsic to the absorption, or just that accommodating absorption is prodding us to test scenarios we didn't check previously, which could also arise without absorption?

If the latter, then maybe we need expect similar problems on the lm(<...>, offset=cov_adj(<...>)) path, not only on the lmitt.formula(<...>, offset=cov_adj(<...>)) path; in any case we may have to rethink cov_adj(). If we do have these broader problems with cov_adj(), I see no reason to preserve its current behavior with respect to clusters not anticipated in the Design; rather it's whatever makes things work. (Perhaps the next task here is to map out the various contingencies and mock up tests of them, en route to settling on a behavior that we hope will fix this issue.)

If on the other hand the issue is intrinsic to the absorption scenario, then perhaps it could be fixed by separating lmitt.formula()'s reconstruction of offsets from it reconstruction of other model frame elements. Looking into the lmitt.formula() sources, if we stripped the offset argument off of the mf.call then we ought to be able to create mf without length mismatch issues. Then we might separately recreate the cov_adj offset, with its mismatching length, creating perhaps offset_mf. Finally we root around in the na.action's of mf and offset_mf in order to sort out what to use from the recreated offset. (In suggesting this approach I don't mean to suggest loyalty to having cov_adj() return objects of the length it's currently returning, only to suggest a preference for the more localized change.)

jwasserman2 commented 1 year ago

What's the issue it's causing? If the weights are NA for those observations, they should be dropped from the model fit anyways, so that shouldn't cause problems right? I also think it makes sense as currently constructed that cov_adj() makes valid predictions for those observations because it doesn't leverage any design information, only covariate information. I feel like it would make sense to keep the cov_adj() behavior as is and instead construct a vector indicating observations to NA out or keep as is based on whether they are part of clusters that match the provided Design object

josherrickson commented 1 year ago

Due to refactoring in the absorb branch, this is no longer an issue.

benbhansen-stats / propertee

`cov.adj` when `Design` and `lmitt` disagree on clusters #103