Valentyn1997 / CausalTransformer

Code for the paper "Causal Transformer for Estimating Counterfactual Outcomes"
MIT License
116 stars 27 forks source link

Questions about how to stably reproducing reported results of CT(alpha=0) and CT with Tumor Cancer Simulation data #5

Closed mengcz13 closed 1 year ago

mengcz13 commented 1 year ago

I found this paper an exciting work in treatment effect estimation over time! But I have encountered some difficulties while attempting to reproduce your reported results of CT(alpha=0) and CT in Figure 2 and Table 9-11. I was wondering if you could kindly provide some guidance or clarification on the methodology or data used in the paper. I am interested in reproducing them since these results are critical to deciding whether the balanced representation is really necessary for CausalTransformer.

As the released code (https://github.com/Valentyn1997/CausalTransformer) did not provide instructions for running experiments with CT(alpha=0), I ran experiments using my own fork, which can be found at https://github.com/mengcz13/CausalTransformer. (You can confirm that my only commit was to add configuration files and scripts for running experiments with CT(alpha=0) and random trajectories.) However, I encountered some difficulties in reproducing the normalized RMSEs reported for CT(alpha=0) and CT with the tumor cancer simulation data (gamma=4) in a stable manner.

I obtained my results on two different servers, one equipped with GTX 1080 Ti and the other with RTX 2080 Ti. However, both servers had identical software setups (Python 3.9, PyTorch 1.12.1, and CUDA 11.3).

Single Sliding

The reported normalized RMSEs of CT(alpha=0) and CT with gamma=4 on the tumor cancer simulation, using a single sliding setting, are as follows:

Step 1 2 3 4 5 6
CT(alpha=0) 1.300(0.220) 1.00(0.21) 1.13(0.28) 1.21(0.32) 1.28(0.34) 1.32(0.34)
CT 1.316(0.229) 1.01(0.23) 1.12(0.27) 1.21(0.30) 1.26(0.31) 1.29(0.29)

I attempted to reproduce these values using the following commands:

# CT(alpha=0)
PYTHONPATH=. python3 runnables/train_multi.py -m +dataset=cancer_sim +backbone=ct '+backbone/ct_hparams/cancer_sim_alpha0="4"' exp.seed=10,101,1010,10101,101010
# CT
PYTHONPATH=. python3 runnables/train_multi.py -m +dataset=cancer_sim +backbone=ct '+backbone/ct_hparams/cancer_sim_domain_conf="4"' exp.seed=10,101,1010,10101,101010
The results obtained from my reproduction attempts are listed in the following table: Step 1 2 3 4 5 6
CT(alpha=0) 1.351(0.222) 0.99(0.15) 1.10(0.18) 1.19(0.20) 1.27(0.21) 1.30(0.20)
CT 1.353(0.244) 1.02(0.26) 1.14(0.32) 1.23(0.34) 1.30(0.36) 1.34(0.35)

Although the results I obtained were similar to the reported values, I noticed a discrepancy in the relationship between CT(alpha=0) and CT. Contrary to the findings in the paper, CT(alpha=0) outperformed CT in the counterfactual predictions for all steps. This suggests that the balanced representation may actually impair multi-step counterfactual prediction in the single sliding setting.

I also observed that the results were not very stable when I ran the same command on a different server. Interestingly, the results from this server aligned with the reported relationship between CT(alpha=0) and CT, whereas the results from my previous attempts did not. Step 1 2 3 4 5 6
CT(alpha=0) 1.310(0.211) 0.98(0.22) 1.11(0.27) 1.19(0.29) 1.26(0.32) 1.31(0.33)
CT 1.323(0.236) 0.97(0.18) 1.08(0.22) 1.17(0.25) 1.25(0.26) 1.29(0.26)

Random Trajectories

I also noticed unstable behavior of the results when using the random trajectories setting. Below is a table of the reported normalized RMSEs of CT(alpha=0) and CT on the tumor cancer simulation with gamma=4:

Step 1 2 3 4 5 6
CT(alpha=0) 1.300(0.220) 1.09(0.28) 1.16(0.37) 1.14(0.39) 1.08(0.38) 1.00(0.36)
CT 1.316(0.229) 1.06(0.27) 1.12(0.32) 1.07(0.35) 1.01(0.34) 0.93(0.32)

These are the results I obtained when I used the following commands to reproduce the results with the random trajectories setting:

# CT(alpha=0)
PYTHONPATH=. python3 runnables/train_multi.py -m +dataset=cancer_sim +backbone=ct '+backbone/ct_hparams/cancer_sim_alpha0="4_rt"' exp.seed=10,101,1010,10101,101010
# CT
PYTHONPATH=. python3 runnables/train_multi.py -m +dataset=cancer_sim +backbone=ct '+backbone/ct_hparams/cancer_sim_domain_conf="4_rt"' exp.seed=10,101,1010,10101,101010 (for CT)

And the results are listed in the following table:

Step 1 2 3 4 5 6
CT(alpha=0) 1.305(0.234) 1.11(0.23) 1.19(0.30) 1.17(0.32) 1.12(0.34) 1.03(0.34)
CT 1.320(0.206) 1.11(0.22) 1.18(0.26) 1.16(0.27) 1.09(0.26) 1.00(0.24)

Notably, the difference between CT(alpha=0) and CT in the 5/6-step prediction was only around 0.03, which is lower than the reported difference of 0.07. Moreover, this difference is closer to the reported difference in the single sliding setting, rather than the random trajectories setting.

I should also note that when I ran the same commands on a different server, I obtained contradictory results again. This inconsistency further supports the notion that the results are not very stable.

Step 1 2 3 4 5 6
CT(alpha=0) 1.364(0.237) 1.14(0.30) 1.20(0.35) 1.16(0.35) 1.09(0.34) 0.99(0.32)
CT 1.375(0.272) 1.16(0.28) 1.23(0.33) 1.19(0.33) 1.11(0.31) 1.01(0.28)

Again, the CT(alpha=0) performs better than CT in the counterfactual predictions of all steps, which is contradictory to the reported results.

Given the inconsistency I observed in my attempts to reproduce the reported results, I am wondering if there is a way to stably demonstrate the benefit of balanced representation from counterfactual domain confusion (i.e., the complete CT) over the variant without it (i.e., the CT(alpha=0)). The difference I obtained between the two models was very subtle and prone to variance from sources such as hardware or non-deterministic operations in the GPU.

In your opinion, is this to be expected, or are there any hints or insights you could offer on how to obtain stable results that demonstrate the advantage of having balanced representation in CausalTransformer? Thank you in advance for your help.

Valentyn1997 commented 1 year ago

Thank you for your interest in our work! This is amazing to hear that you tried to reproduce our results.

I think, that such a discrepancy between reported and reproduced results is to be expected and comes from several sources:

  1. The pseudo-random generation could be different for different hardware. Thus, you will almost never get exactly the same results.
  2. Also, pay attention to the confidence intervals. Certain deviations could be expected, in particular, the top-performing method could be different (e.g., see Figure 2).
  3. The TG-simulator is a fairly simple benchmark, with no time-varying covariates and a one-dimensional outcome. It could happen, that the CT(alpha=0) has a good enough generalization performance. At the same time, CT will perform worse, as it contains extra parameters to estimate (a treatment classifier network). Btw, this is standard for ITE estimation, i.e., when the outcome prediction model is flexible enough and there is enough data, addressing a time-varying confounding is not that important anymore (see https://proceedings.mlr.press/v80/alaa18a.html).
  4. Pay attention, that different trajectories have different variability of the outcome. Thus, CT could specifically aim at counterfactual trajectories, even though, their variability is lower and, thus, they will have lower RMSE than factual ones. This is actually a principal drawback of the current evaluation scheme.

In general, the full CT was more useful for high-dimensional benchmarks, e.g., the semi-supervised one. I hope you will find this information helpful.