How to apply the NCM to high dimensional variables?

onlyonewater commented 1 year ago

hi, authors, great works, now I want to apply the NCM to high dimensional variables, my command is python -m src.experiment.experiment1 Experiment1 -G all -t 1 -d 20 --n-epochs 3000 -r 4 --gpu 1. So here I set the parameter --dim to 20 (more than 1), and set the parameter --n-samples to 10000, but I found that each variable still has 10000x1 size. If I understand right, now each variable should be 10000x20 size, so which step did I understand wrong? Thanks!!!!

Springo commented 1 year ago

Hi, we appreciate the interest in the work! In the code published here, which we used for the experiments of our paper, the dim flag applies to all variables except for X and Y. So, in the backdoor graph for example, it would only apply to Z. Some of the graphs such as the bow graph do not have another variable, so the dim flag has no effect.

X and Y (which may represent treatment and outcome for example) do not change in dimensionality because it would be unclear how to evaluate the model. For the evaluation of the model, we checked queries such as average treatment effect (ATE), which can be computed as P(Y = 1 | do(X = 1)) - P(Y =1 | do(X = 0)). However, if X and Y were multidimensional variables, it would be unclear how to compute this query. In principle, there is nothing wrong with using the NCM for higher dimensional treatment and outcome, but a different query would have to be evaluated.

Moreover, you may find that the maximum likelihood approach implemented here may be slow for higher dimensional cases. We recommend looking at the GAN-NCM implemented here for a faster approach.

onlyonewater commented 1 year ago

thanks for your detailed answer, @Springo , I got it, now I have another question: you say: So, in the backdoor graph for example, it would only apply to Z, why the variable Z could have a higher dimension than variables X and Y?

onlyonewater commented 1 year ago

and I print the size of variable Z in the backdoor graph, the size is still 10000x1, not 10000x20, when I use the command: python -m src.experiment.experiment1 Experiment1 -G backdoor -t 1 -d 20 --n-epochs 3000 -r 4 --gpu 1.

Springo commented 1 year ago

thanks for your detailed answer, @Springo , I got it, now I have another question: you say: So, in the backdoor graph for example, it would only apply to Z, why the variable Z could have a higher dimension than variables X and Y?

To give an example of this, imagine you are evaluating the average treatment effect of a drug (X) on a disease (Y). In this case, perhaps X and Y are binary (e.g. you either take the drug or you don't). However, you may have 20 possible covariates that are confounding factors (i.e. Z is 20-dimensional). One of the dimensions might be gender, another might be smoking history, etc. Still, you may want to compute the ATE in this case. If you were to evaluate it analytically, you could calculate P(Y = y | do(X = x)) = \sum_{z} P(Y = y | X = x, Z = z)P(Z = z), which is well-known to be the expression for the backdoor graph. With a higher dimensionality of Z, this sum can still be computed, although the number of terms in the sum grows exponentially with respect to the dimensionality of Z.

and I print the size of variable Z in the backdoor graph, the size is still 10000x1, not 10000x20, when I use the command: python -m src.experiment.experiment1 Experiment1 -G backdoor -t 1 -d 20 --n-epochs 3000 -r 4 --gpu 1.

Ah, sorry, I see now that increasing dimensionality is not supported by this version of the code. The higher-dimensional data generating features were on an experimental branch that may have conflicted with existing code. Our team will look into what to do about this. In the meantime, you can look into the code I mentioned earlier. It is a more updated version of our codebase and includes all of the functionality in this repository. You can run the same experiment (i.e. identification in 20 dimensions using the MLE-NCM) by running the following command: python -m src.main Experiment1 mle --full-batch --h-size 64 --id-query ATE -r 4 --max-query-iters 3000 --mc-sample-size 10000 -G backdoor -t 1 -n 10000 -d 20 --gpu 1

onlyonewater commented 1 year ago

oh, thanks for your detailed response!!! now I got it, thanks again!!!

onlyonewater commented 1 year ago

hi, authors, when I use the code in https://github.com/CausalAILab/NCMCounterfactuals/tree/main, and use the command you mentioned above: python -m src.main Experiment1 mle --full-batch --h-size 64 --id-query ATE -r 4 --max-query-iters 3000 --mc-sample-size 10000 -G backdoor -t 1 -n 10000 -d 20 --gpu 1, the program seems to generate the data and cannot stop. Is there something wrong?

Springo commented 1 year ago

Hi @onlyonewater, yes this behavior is normal. For 20 dimensions, the data generator is very slow and will take a lot of time to finish.

TL;DR: The data generator we use is slow because we wanted an unbiased data generating environment. You could try 16 dimensions instead, or if you don't care about the unbiased data generator, you can use the flag --gen xor, and data generation should be nearly instant for however many dimensions you want.

Long answer: The data generator we use in this first repository is known as the canonical type model, first formalized in this paper. The idea is that, when the data generator parameters are randomized, the behavior of the functions of each variable could be anything, so the experiments work in a completely unbiased environment as opposed to cherry-picking the data-generating model. The issue is that it works by randomly choosing a function out of all possible functions, which means the time complexity grows doubly exponentially w.r.t. the dimensionality.

On the other hand, the data generator we use in the repository in https://github.com/CausalAILab/NCMCounterfactuals/tree/main is an updated version of the canonical type model described in this paper. The newer model randomizes some aspects of the older model, reducing the runtime to singly exponential w.r.t. dimensionality instead of doubly exponential. Still, 20 dimensions is a lot, and each additional dimension exponentially increases the runtime of the data generator, which is why it is so slow.

The data generator works relatively quickly for 16 dimensions or fewer, so you could try that. On the other hand, if bias of the data generating environment is something you don't care about (e.g., you just want to see the performance of the model in high dimensions), you can try using a different data generator, for example using the flag --gen xor. This changes the data generator to a parametric model that uses XOR functions instead of the canonical type model which randomly chooses functions. The data generation should be near-instant in this case.

onlyonewater commented 1 year ago

oh, thanks for your detailed answer, your answer is very helpful to me, and I will take some time to understand your means because I am a freshman on this topic. Thanks again!!!!

CausalAILab / NeuralCausalModels

How to apply the NCM to high dimensional variables? #2