Designs with time-varying treatment assignment (some units appear both as treated and control units in the data)

jwasserman2 commented 6 months ago

Suppose a user has repeated measures data where units are assigned to treatment at different times:

time_dat <- data.frame(
  id = rep(letters[1:6], each = 2),
  year = rep(c(2020, 2021), 6),
  year_trt = rep(c(2020, 2021, NA_integer_), each = 4),
  trt = c(rep(1, 4), rep(c(0, 1), 2), rep(0, 4)),
  y = rnorm(12)
)
> time_dat
   id year year_trt trt           y
1   a 2020     2020   1 -0.27735593
2   a 2021     2020   1  1.08304734
3   b 2020     2020   1  0.26373897
4   b 2021     2020   1 -0.64004253
5   c 2020     2021   0  0.16796533
6   c 2021     2021   1 -0.93694611
7   d 2020     2021   0  2.12654221
8   d 2021     2021   1  1.16774163
9   e 2020       NA   0  1.14521015
10  e 2021       NA   0 -0.45782338
11  f 2020       NA   0 -0.30100410
12  f 2021       NA   0  0.06839782

Units appear multiple times in the dataset, sometimes with differing assignment values, making assignment status specific to id and year. The user should be able to create a Design object with something like:

*_design(trt ~ unitid(id), data = time_dat, dichotomy = year >= year_trt ~ .)

The unitid\cluster\unit_of_assignment should be id, but each unit should be allowed to have rows in the dataframe with different assignment statuses as long as the assignment statuses match the dichotomy. We can see this is the case for the user's data since trt == 1 only when year >= year_trt.

This requires a few updates:

making the accommodation above, here: https://github.com/benbhansen-stats/propertee/blob/f5a26bf8d90f794d92d63761184d9b1ac9c0f9d7/R/Design.R#L199-L213

allowing non-treatment variables in the dichotomy. Previously, this affected assigned()+ its aliases, ate() and ett(), but after #162, this affects the creation of Design objects:

> time_dat <- data.frame(
+   id = letters[1:6],
+   trt_year = rep(c(2020, 2021, NA_integer_), each = 2),
+   trt_ever = c(rep(1, 4), rep(0, 2))
+ )
> des <- obs_design(trt ~ unitid(id), time_dat, dichotomy = year ~ is.na(year))
Error in eval(predvars, data, env) : object 'year' not found

This error arises in treatment(). treatment() calls.bin_txt(), shown below, and calls .apply_dichotomy(treatment(des, binary = FALSE), des@dichotomy). But treatment(des, binary = FALSE) only returns the original treatment column in the design data when in this case, .apply_dichotomy() also needs the columns specified in the dichotomy. This leads to the error above. https://github.com/benbhansen-stats/propertee/blob/f5a26bf8d90f794d92d63761184d9b1ac9c0f9d7/R/DesignAccessors.R#L171-L181

My proposal for addressing this would be to add columns specified in the dichotomy to the structure slot of a Design object. We could then use logic like .get_col_from_new_data(x, newdata, type = "d", by), where "d" stands for "dichotomy", to access them when needed.

updating ate() and ett(), if need be, to give appropriate weights to units that are sometimes control and sometimes treatment
making sure blocks can handle units that have varying assignment statuses

These potential changes touch a lot of our codebase, so I think discussing the best way to approach it here before diving in on it would be a good idea.

benthestatistician commented 6 months ago

The approach we aimed to support and implement is a bit narrower than what's proposed here, serving a subset of the use cases that this would serve. (With apologies for not remembering to note this when we spoke earlier today!) In this other approach, units of assignment have only one treatment status, but that status is permitted to be ordinal. In the toy data at the top of this issue, units a and b would have trt=='2020', units c and d would have trt=="2021", and units e and f would have trt=".", say, with trt being an ordered factor with levels ordering "." < "2021" < "2020" .

Let's suppose that in the toy example above the blocks are { a, e} and {b , c , d, f}, and suppose we want to do weighting by the odds, separately in years 2020 and 2021. The pair { a, e} is easy: in both years it's got 1 treatment and one control, and both receive weights of 1. For the 4-tuple {b , c , d, f}, the right answer depends on the year: in year 2020 there's just one of the 4 under treatment, b, so its weight is 1 and the others' weights should be 1/3 each; in year 2021 b , c , and d are each under treatment, so their weights are 1 and f's weight is 3. I think that's what ett() will give us for an appropriately encoded design and an appropriate invocation of the function.

For this I think the design should be encoded without a dichtomy , with the above communicated through the formula passed to ett() via its dichotomy= argument. It's getting late on me know, so I figure I'll check on this in the a.m.

jwasserman2 commented 6 months ago

assigned() doesn't currently accept a dichotomy argument, so I think the Design needs a dichotomy for assigned() to create the trt column of my example dataset. Could you explain how you see the Design obtaining information of how to weight separately in different years? In my mind, you would need a dichotomy that incorporates a column denoting assignment status in a given year, like the one I proposed above. This still necessitates modification b from my first comment. Maybe you have something else in mind, though.

josherrickson commented 6 months ago

I'm trying to wrap my head around this issue. Is this similar to the discussion at https://github.com/benbhansen-stats/propertee/issues/30#issuecomment-1064480711?

jwasserman2 commented 6 months ago

I believe my current thoughts are exactly what you've written there. How did you address that using dichotomy?

benthestatistician commented 6 months ago

OK, so I've done a bit of the experimenting I had resolved to do last night. What I was envisioning does seem to be doable, if not in the smoothest way. The toy example Josh W provided could help us improve.

Josh W's correct that assigned() won't currently do what I was hoping for. Neither do ett() and ate(), although they're closer. 73eee5c puts one example of their use into a new test, in test.CombinedWeightedDesign.R, on new branch i30_dichotomy. I'm going to reopen #30 and record some other comments over there.

josherrickson commented 6 months ago

Well, we ended up with dichotomization + CombinedWeightedDesigns. What is stopping a similar process from being used here? Generate a single-dimensional treatment variable (paste(year, trt)), use dichomization to generate weights per year, then combine them?

benthestatistician commented 6 months ago

I somewhat misunderstood @jwasserman2's proposal at the top of this issue when I first replied to it. (In the passage below I was taking "the dataframe" to refer to a table stored within the Design object, but I see now that it was mean to refer to the analysis dataframe (time_dat) in Josh's example:

each unit should be allowed to have rows in the dataframe with different assignment statuses as long as the assignment statuses match the dichotomy. We can see this is the case for the user's data since trt == 1 only when year >= year_trt.

Also "assignment status" refers to a binary treatment variable as would be created by assigned(), perhaps distinct from the t column of the Design@structure data frame.) Now that I better understand the gist of the proposal, I'm a fan.

I'd like to suggest a modification to it making it a smaller change to the Design class spec. Specifically, I'd omit the part of hte proposal about adding columns to the @structure tables of design-class objects. I think the proposal should leave that as-is and instead change the meaning of the Design@dichotomy slot. The proposal should require that its expressions always evaluate to the correct binary treatment specification within a data frame merging the analysis data with the Design's @structure data frame.

Turning now to @josherrickson's comment above (suggesting per-year dichotomization followed by combining the years), a downside of the amended proposal is that WeightedDesigns couldn't be freely c()'ed together as they are now while also preserving the new semantics of their dichotomy slots. As a result this would interfere with the compute-separately-by-chunk approach that Josh E's latest comment references. But that isn't such an appealing workflow, and I think we should be open to retiring it if we can get its benefits in other ways.

Additional notes:

@josherrickson: I think our earlier discussions that led to dichotomization + CombinedDesignWeights were hampered by not having a clean simple example. Now that Josh W has given us one of those, I'm hoping we can make some more progress? See my comment reopening #30. Addressing your question just above, the main outstanding challenge is to generate a treatment variable for each year (that propertee functions will recognize as such) alongside of the weights we create for each year.

@jwasserman2: I think the test I added on the i30_dichotomy branch addresses your request that I "explain how [I] see the Design obtaining information of how to weight separately in different years," if not necessarily with elegance. When in the same followup comment you referenced "modification b from my first comment", I think you were pointing to the proposal rendering as proposal 2 in your initial comment, "allowing non-treatment variables in the dichotomy"? Please correct me if I'm mistaken.

jwasserman2 commented 6 months ago

@benthestatistician thanks for pointing me to your branch. That setup makes a lot of sense to me. Yes, modification 2 referred to allowing non-treatment variables to be reference in the dichotomy. The way you set it up in your branch avoids the need for it, though.

benthestatistician commented 6 months ago

Let's continue this discussion over in #30, where in the comments I have a proposal incorporating important elements of the proposal at the top of this thread. My proposal differs from the above in ways indicated by my comments above, and some others. To avoid confusion moving forward I'm going to close this issue out.

benbhansen-stats / propertee

Designs with time-varying treatment assignment (some units appear both as treated and control units in the data) #170

Additional notes: