Questions about how to stably reproducing reported results of CT(alpha=0) and CT with Tumor Cancer Simulation data

I found this paper an exciting work in treatment effect estimation over time! But I have encountered some difficulties while attempting to reproduce your reported results of CT(alpha=0) and CT in Figure 2 and Table 9-11. I was wondering if you could kindly provide some guidance or clarification on the methodology or data used in the paper. I am interested in reproducing them since these results are critical to deciding whether the balanced representation is really necessary for CausalTransformer.

As the released code (https://github.com/Valentyn1997/CausalTransformer) did not provide instructions for running experiments with CT(alpha=0), I ran experiments using my own fork, which can be found at https://github.com/mengcz13/CausalTransformer. (You can confirm that my only commit was to add configuration files and scripts for running experiments with CT(alpha=0) and random trajectories.) However, I encountered some difficulties in reproducing the normalized RMSEs reported for CT(alpha=0) and CT with the tumor cancer simulation data (gamma=4) in a stable manner.

I obtained my results on two different servers, one equipped with GTX 1080 Ti and the other with RTX 2080 Ti. However, both servers had identical software setups (Python 3.9, PyTorch 1.12.1, and CUDA 11.3).

Single Sliding

The reported normalized RMSEs of CT(alpha=0) and CT with gamma=4 on the tumor cancer simulation, using a single sliding setting, are as follows:

Step	1	2	3	4	5	6
CT(alpha=0)	1.300(0.220)	1.00(0.21)	1.13(0.28)	1.21(0.32)	1.28(0.34)	1.32(0.34)
CT	1.316(0.229)	1.01(0.23)	1.12(0.27)	1.21(0.30)	1.26(0.31)	1.29(0.29)

I attempted to reproduce these values using the following commands:

# CT(alpha=0)
PYTHONPATH=. python3 runnables/train_multi.py -m +dataset=cancer_sim +backbone=ct '+backbone/ct_hparams/cancer_sim_alpha0="4"' exp.seed=10,101,1010,10101,101010
# CT
PYTHONPATH=. python3 runnables/train_multi.py -m +dataset=cancer_sim +backbone=ct '+backbone/ct_hparams/cancer_sim_domain_conf="4"' exp.seed=10,101,1010,10101,101010

The results obtained from my reproduction attempts are listed in the following table:	Step	1	2	3	4	5	6
CT(alpha=0)	1.351(0.222)	0.99(0.15)	1.10(0.18)	1.19(0.20)	1.27(0.21)	1.30(0.20)
CT	1.353(0.244)	1.02(0.26)	1.14(0.32)	1.23(0.34)	1.30(0.36)	1.34(0.35)

Although the results I obtained were similar to the reported values, I noticed a discrepancy in the relationship between CT(alpha=0) and CT. Contrary to the findings in the paper, CT(alpha=0) outperformed CT in the counterfactual predictions for all steps. This suggests that the balanced representation may actually impair multi-step counterfactual prediction in the single sliding setting.

I also observed that the results were not very stable when I ran the same command on a different server. Interestingly, the results from this server aligned with the reported relationship between CT(alpha=0) and CT, whereas the results from my previous attempts did not.	Step	1	2	3	4	5	6
CT(alpha=0)	1.310(0.211)	0.98(0.22)	1.11(0.27)	1.19(0.29)	1.26(0.32)	1.31(0.33)
CT	1.323(0.236)	0.97(0.18)	1.08(0.22)	1.17(0.25)	1.25(0.26)	1.29(0.26)

Random Trajectories

I also noticed unstable behavior of the results when using the random trajectories setting. Below is a table of the reported normalized RMSEs of CT(alpha=0) and CT on the tumor cancer simulation with gamma=4:

Step	1	2	3	4	5	6
CT(alpha=0)	1.300(0.220)	1.09(0.28)	1.16(0.37)	1.14(0.39)	1.08(0.38)	1.00(0.36)
CT	1.316(0.229)	1.06(0.27)	1.12(0.32)	1.07(0.35)	1.01(0.34)	0.93(0.32)

These are the results I obtained when I used the following commands to reproduce the results with the random trajectories setting:

# CT(alpha=0)
PYTHONPATH=. python3 runnables/train_multi.py -m +dataset=cancer_sim +backbone=ct '+backbone/ct_hparams/cancer_sim_alpha0="4_rt"' exp.seed=10,101,1010,10101,101010
# CT
PYTHONPATH=. python3 runnables/train_multi.py -m +dataset=cancer_sim +backbone=ct '+backbone/ct_hparams/cancer_sim_domain_conf="4_rt"' exp.seed=10,101,1010,10101,101010 (for CT)

And the results are listed in the following table:

Step	1	2	3	4	5	6
CT(alpha=0)	1.305(0.234)	1.11(0.23)	1.19(0.30)	1.17(0.32)	1.12(0.34)	1.03(0.34)
CT	1.320(0.206)	1.11(0.22)	1.18(0.26)	1.16(0.27)	1.09(0.26)	1.00(0.24)

Notably, the difference between CT(alpha=0) and CT in the 5/6-step prediction was only around 0.03, which is lower than the reported difference of 0.07. Moreover, this difference is closer to the reported difference in the single sliding setting, rather than the random trajectories setting.

I should also note that when I ran the same commands on a different server, I obtained contradictory results again. This inconsistency further supports the notion that the results are not very stable.

Step	1	2	3	4	5	6
CT(alpha=0)	1.364(0.237)	1.14(0.30)	1.20(0.35)	1.16(0.35)	1.09(0.34)	0.99(0.32)
CT	1.375(0.272)	1.16(0.28)	1.23(0.33)	1.19(0.33)	1.11(0.31)	1.01(0.28)

Again, the CT(alpha=0) performs better than CT in the counterfactual predictions of all steps, which is contradictory to the reported results.

Given the inconsistency I observed in my attempts to reproduce the reported results, I am wondering if there is a way to stably demonstrate the benefit of balanced representation from counterfactual domain confusion (i.e., the complete CT) over the variant without it (i.e., the CT(alpha=0)). The difference I obtained between the two models was very subtle and prone to variance from sources such as hardware or non-deterministic operations in the GPU.

In your opinion, is this to be expected, or are there any hints or insights you could offer on how to obtain stable results that demonstrate the advantage of having balanced representation in CausalTransformer? Thank you in advance for your help.

Valentyn1997 / CausalTransformer

Questions about how to stably reproducing reported results of CT(alpha=0) and CT with Tumor Cancer Simulation data #5

Single Sliding

Random Trajectories