Intelligent merging of C and Q samples for estfun.DA() & related functions

benthestatistician commented 1 year ago

Per my and @jwasserman2's offline discussion, there are some broader issues lurking behind this one. Without explaining here, they lead me to ask about feasibility of adding a by= argument to cov_adj() that would be used to collect from the user the names of key variables for aligning the covariance adjustment samples with the other tables we need. I'll state a preview of the issues here in hopes of discussing at an upcoming ftf meeting; @josherrickson @jwasserman2

The primary purpose of this by= argument would be to give users a means of specifying variables with which to join the covariance and effect estimation samples. Given an empty by= argument, we might join on variables with common names. (Or perhaps a slight variant of this that respects any translation between key variables of the Design and effect estimation tables that were already communicated in setting up a WeightedDesign, ie passed to ett() ore att() via a by= argument .) This would help us keep better track of the sample size (in observation not assignment units) of the combined covariate adjustment and quasiexperimental samples. Whether taken from a user-provided by= argument to cov_adj() or just inferred by comparing column names in the provided tables, the covariance adjustment data's key columns for joining to the effect estimation data could be stored as part of the keys slot of the SandwichLayer object that is returned by cov_adj().

(Secondarily, we presently align the cov_adj sample with the Design@Structure table by looking in the cov_adj table for variables with the same names as key variables listed in the Design@Structure table. This by= argument I'm envisioning might also be used to offer a means of proceeding when the key variables have different names in these two tables. But perhaps this overcomplicates matters.)

jwasserman2 commented 1 year ago

For my own memory of what I've proposed as a solution:

Don't join--keep the same number of rows from each table and don't condense
Sort the table according to the overlapping ID columns, which would result in the kind of structure in Ben's diagram but for each student (so the overall table would alternate between 0's in the upper RHS and 0's in the lower LHS for each student)

benthestatistician commented 1 year ago

The "Don't join" approach described above gives the same HCO vcov estimates that you get if you join the samples more precisely, but at the expense of distorting the sample size. In turn this spoils degrees-of-freedom adjustments built into other HC* methods.

(My understanding is that at present we do join the tables in simple cases, including those where units of assignment coincide with units of observation in both C and Q samples, reverting to "Don't join" only in more complex circumstances.)

jwasserman2 commented 1 year ago

The "Don't join" approach described above gives the same HCO vcov estimates that you get if you join the samples more precisely, but at the expense of distorting the sample size.

I take you to mean here that by not exactly joining we're artificially increasing the sample size?

(My understanding is that at present we do join the tables in simple cases, including those where units of assignment coincide with units of observation in both C and Q samples, reverting to "Don't join" only in more complex circumstances.)

Do you see this in the code somewhere? I don't believe there's any merging between the two samples anywhere. There's merging between C and the Design object's structure dataframe that happens when creating a SandwichLayer's keys slot, but there isn't any merging between C and Q at the moment.

benthestatistician commented 1 year ago

I take you to mean here that by not exactly joining we're artificially increasing the sample size?

-Yep, that's what I had in mind.

Do you see this in the code somewhere? I don't believe there's any merging between the two samples anywhere. There's merging between C and the Design object's structure dataframe that happens when creating a SandwichLayer's keys slot, but there isn't any merging between C and Q at the moment.

--No, that was based on a recollection of our conversations rather than code review. In the specific scenario I described above, isn't it the case that we get alignment of C and Q samples indirectly, in virtue of both of them lining up in a 1-1 fashion with the structure table in the Design object? That's all I was trying to say, not that there's an explicit join happening between C and Q.

jwasserman2 commented 1 year ago

We are going to change the keys slot of SandwichLayer so the values in the columns specified in the by argument of cov_adj() for unmatched observations in the covariance adjustment sample are not NA's, but rather the original values in the covariance adjustment dataset. This will facilitate easier alignment of rows in estfun.DirectAdjusted() and, in some casesm .make_uoa_ids(). To retrieve the info lost by not translating these values to NA's, we will add a Boolean column indicating whether the observation in the covariance adjustment sample matched to the quasiexperimental sample.

I will update the SandwichLayer spec in the vignettes folder to reflect this update.

benbhansen-stats / propertee

Intelligent merging of C and Q samples for estfun.DA() & related functions #99