Clustering for small blocks

jwasserman2 commented 8 months ago

Based on discussions with @xinhew0708 and @benthestatistician, there are a couple pieces of functionality we want relating to small-block clustering:

[x] A function identify_small_blocks() that counts the treated and control units of assignment in each block and indicates whether it only has one treated or one control unit.
[x] Model-based SE calculations should use the results of identify_small_blocks() to change the clustering level for units of assignment in small blocks to the block level.

We discussed pulling clustering information from a stored dataframe that has a unit of assignment column and a cluster column. When creating a Design object, we could store a base version of this dataframe as a slot based on the unitid()/uoa()/cluster() part of the formula. All vcovDA() calls would pull the cluster column, but for model-based calls, the cluster ID's given by unitid()/uoa()/cluster() would be replaced by block ID's for small blocks (based on the results of identify_small_blocks()). Additionally, If a user specifies a different clustering level using the cluster argument of vcovDA(), the values of the cluster column would be updated to reflect the specified clustering level.

Let me know if anyone sees an issue with this approach or this differs from what they had in mind

josherrickson commented 8 months ago

Can we make identify_small_blocks more generic? Something like block_sizes, and then call something like

bs <- block_sizes(...)
is_small <- bs[bs$n_control == 1 | bs$n_treatment == 1, ]

Seems like it may be more generally useful.

jwasserman2 commented 8 months ago

It seems like design_table(design_object, "t", "b") already gets us the block sizes:

data(simdata)
des <- rct_design(z ~ cluster(cid1, cid2) + block(bid), simdata)
design_table(des, "t", "b")
      treatment
blocks 0 1
     1 3 1
     2 2 1
     3 1 2

identify_small_blocks() can wrap around that and convert the output into a logical vector of the same length as the row dimension, with names given by the row names.

jwasserman2 commented 8 months ago

Also, unitids()/units_of_assignment()/clusters(), whichever one coincides with the function used in the Design formula, returns unit of assignment ID columns, so to create the base dataframe I described above we would only need to add a cluster ID column to the output of unitids()/units_of_assignment()/clusters():

clusters(des)
   cid1 cid2
1     1    1
2     1    2
3     2    1
4     2    2
5     3    1
6     3    2
7     4    1
8     4    2
9     5    1
10    5    2

josherrickson commented 8 months ago

Ah apparently my idea was so good that I'd already implemented and forgotten about it.

jwasserman2 commented 8 months ago

See #161 for implementation

benbhansen-stats / propertee

Clustering for small blocks #154