Closed jwasserman2 closed 1 year ago
For my own memory of what I've proposed as a solution:
The "Don't join" approach described above gives the same HCO vcov estimates that you get if you join the samples more precisely, but at the expense of distorting the sample size. In turn this spoils degrees-of-freedom adjustments built into other HC* methods.
(My understanding is that at present we do join the tables in simple cases, including those where units of assignment coincide with units of observation in both C and Q samples, reverting to "Don't join" only in more complex circumstances.)
The "Don't join" approach described above gives the same HCO vcov estimates that you get if you join the samples more precisely, but at the expense of distorting the sample size.
I take you to mean here that by not exactly joining we're artificially increasing the sample size?
(My understanding is that at present we do join the tables in simple cases, including those where units of assignment coincide with units of observation in both C and Q samples, reverting to "Don't join" only in more complex circumstances.)
Do you see this in the code somewhere? I don't believe there's any merging between the two samples anywhere. There's merging between C and the Design
object's structure
dataframe that happens when creating a SandwichLayer
's keys
slot, but there isn't any merging between C and Q at the moment.
I take you to mean here that by not exactly joining we're artificially increasing the sample size?
-Yep, that's what I had in mind.
Do you see this in the code somewhere? I don't believe there's any merging between the two samples anywhere. There's merging between C and the Design object's structure dataframe that happens when creating a SandwichLayer's keys slot, but there isn't any merging between C and Q at the moment.
--No, that was based on a recollection of our conversations rather than code review. In the specific scenario I described above, isn't it the case that we get alignment of C and Q samples indirectly, in virtue of both of them lining up in a 1-1 fashion with the structure table in the Design object? That's all I was trying to say, not that there's an explicit join happening between C and Q.
We are going to change the keys
slot of SandwichLayer
so the values in the columns specified in the by
argument of cov_adj()
for unmatched observations in the covariance adjustment sample are not NA's, but rather the original values in the covariance adjustment dataset. This will facilitate easier alignment of rows in estfun.DirectAdjusted()
and, in some casesm .make_uoa_ids()
. To retrieve the info lost by not translating these values to NA's, we will add a Boolean column indicating whether the observation in the covariance adjustment sample matched to the quasiexperimental sample.
I will update the SandwichLayer
spec in the vignettes
folder to reflect this update.
Per my and @jwasserman2's offline discussion, there are some broader issues lurking behind this one. Without explaining here, they lead me to ask about feasibility of adding a
by=
argument tocov_adj()
that would be used to collect from the user the names of key variables for aligning the covariance adjustment samples with the other tables we need. I'll state a preview of the issues here in hopes of discussing at an upcoming ftf meeting; @josherrickson @jwasserman2The primary purpose of this
by=
argument would be to give users a means of specifying variables with which to join the covariance and effect estimation samples. Given an emptyby=
argument, we might join on variables with common names. (Or perhaps a slight variant of this that respects any translation between key variables of the Design and effect estimation tables that were already communicated in setting up a WeightedDesign, ie passed toett()
oreatt()
via aby=
argument .) This would help us keep better track of the sample size (in observation not assignment units) of the combined covariate adjustment and quasiexperimental samples. Whether taken from a user-providedby=
argument tocov_adj()
or just inferred by comparing column names in the provided tables, the covariance adjustment data's key columns for joining to the effect estimation data could be stored as part of thekeys
slot of theSandwichLayer
object that is returned bycov_adj()
.(Secondarily, we presently align the cov_adj sample with the
Design@Structure
table by looking in the cov_adj table for variables with the same names as key variables listed in theDesign@Structure
table. Thisby=
argument I'm envisioning might also be used to offer a means of proceeding when the key variables have different names in these two tables. But perhaps this overcomplicates matters.)