dpuelz / BicliqueRT

Randomization Tests of Causal Effects under General Interference
https://arxiv.org/pdf/1910.10862.pdf
6 stars 4 forks source link

Regarding the power of the testing #21

Closed samzl1 closed 1 year ago

samzl1 commented 1 year ago

Hi David,

Thank you for solving the size issue! Sorry for the back and forth. I really want to implement your package in my paper; that's why I dig so deep into the details...

This time is regarding the power of the test. I'm trying to replicate the results from section 6.2 based on the cluster interference code you have. I used N=300 and K=20, which is the same as the left-hand side plot in the paper. I generated 5,000 different assignments and construct the null exposure graph for the biclique method with mina = 30, as in the example code. For each \tau, I generated 2,000 data sets from the DGP with out_bassefeller function and tau_equal = T; \tau is varied among 50 equally spaced values from 0 to 1. The code I used to produce the power plot of the test is attached.

As you can see, the power increased slowly and was capped at 0.8, which looks different from the graph in the paper. Is there any setting I should be aware of for a better power?

github_try

rm(list=ls()) library(BicliqueRT) library(doParallel) library(doRNG)

set.seed(1)

N = 300 # total number of units K = 20 # 20, 30, 75 total number of households, i.e., number of clusters

num_randomizations = 5e3 sim_randomizations = 2e3 tau_length = 50 alpha=0.05

housestruct = out_house_structure(N, K, T) treateffect = seq(0, 1, length.out = tau_length)

The design function:

design_fn = function(){ treatment_list = out_treat_household(housestruct, K1 = K/2) # one unit from half of the households would be treated. treatment = out_Z_household(N, K, treatment_list, housestruct) return(treatment[,'treat']) }

The exposure function: exposure for each unit i is zi + \sum{j \in [i]} z_j where [i] represents the cluster i is in.

exposure_i = function(z, i) {

find the household that i is in

house_ind = cumsum(housestruct) which_house_i = which.min(house_ind < i)

find lower and upper index of [i] in z

if (which_house_i == 1) {lower_ind = 1} else {lower_ind = house_ind[which_house_i-1] + 1} upper_ind = house_ind[which_house_i]

calculate exposure

exposure_z = z[i] + sum(z[lower_ind:upper_ind]) exposure_z }

The null

null_equiv = function(exposure_z1, exposure_z2) { ((exposure_z1 == 1) | (exposure_z1 == 0)) & ((exposure_z2 == 1) | (exposure_z2 == 0)) }

Generate a treatment realization and outcome

Z = design_fn() # randomly get one realization

Generate exposure under the realized Z

Z_exposure = rep(0, N); for (i in 1:N) { Z_exposure[i] = exposure_i(Z, i) }

Do biclique decomposition on the null exposure graph

H0 = list(design_fn=design_fn, exposure_i=exposure_i, null_equiv=null_equiv) bd = biclique.decompose(Z, H0, controls= list(method="greedy", mina=30, num_randomizations = 5e3))

Define a test statistic, here we use the contrast of mean between units with exposure 1 (untreated in treated cluster) and exposure 0 (untreated in untreated cluster)

Tstat = function(y, z, is_focal) { exposures = rep(0, N) for (unit in 1:N) { exposures[unit] = exposure_i(z, unit)[1] } stopifnot("all focals have same exposures" = (length(unique(exposures[is_focal]))>1) ) mean(y[is_focal & (exposures == 1)]) - mean(y[is_focal & (exposures == 0)]) }

powerbi<-function(Y){

tau=0 is the tau in the null: Y_i(b) = Y_i(a) + tau for all i.

Then run the test

testout = clique_test(Y, Z, Tstat, bd)

pval=testout$p.value # p-value of the test return(pval)

}

test_bi<-function(tt){

simulating an outcome vector

sim=lapply(tt, function(s) out_bassefeller(N, K, Z_exposure, tau_main = s, housestruct = housestruct, tau_equal = T)$Yobs) #should be the right way to construct Y?

Pvalbi=lapply(sim, powerbi)

return(unlist(Pvalbi)) }

nCores <- 6 # number of CPUs for parallelization cl<-makeCluster(nCores) registerDoParallel(cl) results.par.bi <- foreach(i=1:sim_randomizations,.packages="BicliqueRT", .combine=rbind) %dorng% {

results.par.bi <- test_bi(treateffect)

}

pp_bi = colMeans(results.par.bi<alpha)

stopCluster(cl)

visualization

pp=data.frame(Power=pp_bi, effect_size=treateffect)

library(ggplot2) graph<- ggplot(pp, aes(y=Power, x = effect_size) ) + geom_point()+ geom_line()+ labs(x = "Magnitude of the effect", y="Power")+ theme_minimal() + ggtitle(bquote(paste("Power analysis when number of households is ", .(K), " and number of people is ", .(N))))

print(graph)

dpuelz commented 1 year ago

Hey there, we're happy to hear you are using our package! I am having my colleague look into this. Here's the general pseudo-code for power calculations (for a fixed additive treatment effect tau):

  1. Decompose the NE graph into bicliques at the outset, and keep this fixed. Let's call it BDecom.
  2. Sample Z' from your set of randomizations, and conditional on Z' generate your Yobs based on tau.
  3. Run the test conditional on the biclique in BDecom that Z' lives in.
  4. Replicate steps 2-3 many times, and average the # of rejections to get the power for that fixed tau.

We'll let you know shortly on an answer, and in the meantime please confirm that you are doing this.

samzl1 commented 1 year ago

Thanks for the quick response. I feel the clique_test() is using the outset fixed biclique (which is bd in my code), rather than computing a new biclique each time. The procedure you described is what I'm doing right now.

dpuelz commented 1 year ago

Just realized that after I wrote the response.  Actually, I recall we updated the clique_test function like this precisely for these calculations and to separate out the decomposition from the test.  We’re looking into your specific question and will get back to you.

samzl1 commented 1 year ago

Hi David,

Any update on the power issue?

szhuang1 commented 1 year ago

Hi samzl1,

Thanks for your detailed investigation on the package. This issue, unfortunately, is again due to how we calculate the p-value (or we can say it's due to the sample size is not large enough). Currently the two-sided p-value is calculated by firstly calculating the one-sided version as

 M = sum(abs(tvals - tobs) < 1e-14)
 p_val = (sum(tvals > tobs) + runif(1) * (1 + M))/(length(tvals) + 1)

and return 2*min(p_val, 1-p_val) where tobs is the observed test statistic and tvals is a list of randomized test stat.

In the example when the magnitude of the effect is quite large, the observed tobs is quite large and larger than all tvals, which should be a clear signal to reject the null. But now the p_val becomes U/(n+1) where U~Unif(0,1) and n is the length of tvals, which in the example is about 30. Let's take it to be U/n. The decision is based on whether 2*min{U/n, 1-U/n} < alpha=0.05.

But here comes the problem. Taking n=30, Prob{2 min{U/n, 1-U/n} < alpha=0.05} = Prob{U < 0.025 n} = 0.75. That is, on average we can only reject 75% of time, which is why the power in your plot is capped at around 0.8. The issue here is that n is not large enough so that the randomness in the p-value makes Prob{U < 0.025*n} < 1.

This is in general a drawback for such randomized p-values. But in this example a simple remedy is to take the test stat to be

Tstat = function(y, z, is_focal) {
    exposures = rep(0, N)
    for (unit in 1:N) {
    exposures[unit] = exposure_i(z, unit)[1]
    }
    stopifnot("all focals have same exposures" = (length(unique(exposures[is_focal]))>1) )
    abs(mean(y[is_focal & (exposures == 1)]) - mean(y[is_focal & (exposures == 0)]))
}

that is, we look at the absolute differences in means. In this case a large Tstat gives evidence of rejection as in many papers, and now we can make decision based on the one-sided p-value p_val = (sum(tvals > tobs) + runif(1) * (1 + M))/(length(tvals) + 1). In the code it simply suffices to set testout = clique_test(Y, Z, Tstat, bd, two_sided=F). And this solves the problem: image

There's another way to solve it by using a non-random p-value. such as pval=2*min(p, 1-p) where p=mean(tvals > tobs). This amounts to modify the clique_test function as

clique_test1 = function (Y, Z, teststat, biclique_decom, alpha = 0.05, two_sided = T) 
{
  stopifnot(`teststat should be specified as a function` = is.function(teststat))
  method = biclique_decom$method
  MNE = biclique_decom$MNE
  Zobs_id = 1
  Zobs_where = lapply(MNE, function(b) {
    sum(colnames(b$assignments) == Zobs_id) > 0
  })
  Zobs_where = which(Zobs_where == TRUE)
  focal_clique = MNE[[Zobs_where]]
  focal_unit = focal_clique$units
  focal_Zm = focal_clique$assignments
  num_units = dim(focal_Zm)[1]
  focal_unit_indicator = 1:num_units %in% focal_unit
  test_stats = rep(0, dim(focal_Zm)[2])
  for (zid in 1:length(test_stats)) {
    Z_focal = focal_Zm[, zid]
    test_stats[zid] = teststat(Y, Z_focal, focal_unit_indicator)
  }
  tobs = test_stats[which(colnames(focal_Zm) == Zobs_id)]
  tvals = test_stats[which(colnames(focal_Zm) != Zobs_id)]
  if (anyNA(test_stats)) {
    warning("The test statistic or its randomization distribution contains NA. It may be due to poor\n            selection of focals. Consider rerun the biclique decomposition.")
    pval = 3
    retlist = list(p.value = pval, statistic = tobs, statistic.dist = tvals, 
                   method = method, conditional.clique = focal_clique)
    return(retlist)
  }
  if (sum(abs(tvals - tobs))/length(test_stats) < 1e-14) {
    warning("Randomization distribution is degenerate. p.value is not available. Consider changing the test statistic.\n            See ... for further discussions.")
    pval = 2
    retlist = list(p.value = pval, statistic = tobs, statistic.dist = tvals, 
                   method = method, conditional.clique = focal_clique)
    return(retlist)
  }
  stopifnot(`two_sided should be either TRUE or FALSE` = is.logical(two_sided))

  avg_great = mean(tvals > tobs)

  if (two_sided) {
    pval = 2*min(avg_great, 1-avg_great)
  }
  else {
    pval = avg_great
  }
  retlist = list(p.value = pval, statistic = tobs, statistic.dist = tvals, 
                 method = method, conditional.clique = focal_clique)
  return(retlist)
}

Now you can keep the original test stat (i.e., no abs on it) and use testout = clique_test1(Y, Z, Tstat, bd), which can still give the correct power as in the below. But as you see earlier, such a non-randomized p-value performs poorly in the size simulation. image

We suggest making Tstat such that a large Tstat indicates rejection, and using the one-sided p-value as a default choice.

samzl1 commented 1 year ago

Hi szhuang1,

This is very helpful; thank you for the detailed answer!