Questions about usage of DirectLiNGAM

soya-beancurd commented 1 year ago

Hello, I’d first like to thank you for this incredible package (along with the interesting papers on LiNGAM you’ve published)!

I’m currently trying to employ this package in my Causal Inference pipeline (causal discovery portion).

I’m currently trying to measure the treatment effects of a binary treatment on a binary outcome, in the presence of hundreds of potential confounders
At present, I am trying to first identify a causal graph using this package for relatively large datasets before identifying confounders using dowhy’s Backdoor Criterion and estimating treatment effects using EconML’s methods.
The datasets have sizes ranging about ~250k - 750k rows/entries and 150 - 600+ columns/features

More specifically, I am currently using DirectLiNGAM with a prior knowledge matrix (specifically for having an edge from the treatment to outcome variable, and that there should be no other outgoing edges from both the treatment and outcome variables). BottomUpParceLiNGAM would have been the ideal model but it dosent work due to scalability and instant out of memory issues.

After running a couple of experiments with DirectLiNGAM, I have 3 questions I’d like to clarify with you if possible:

(Extension of Issue 5) As my dataset has a good mix of continuous and discrete data, are there currently any models in the package that can deal with both continuous and discrete (i.e, binary and encoded categorical) variables, while allowing for prior knowledge matrices?
Are there any scalable alternatives to validate the results and assumptions of LiNGAM models in general aside from get_error_independence_p_values and bootstrap? As the former (specifically during hsic_test_gamma) causes out of memory issues for even the smallest dataset (e.g., 250k x 155), while the latter takes too long (i.e, the default fit in DirectLiNGAM with the above mentioned prior-knowledge matrix ranges between 20 hours - 5 days for the datasets I currently have)
It seems that regardless of the dataset, DirectLiNGAM consistently outputs a causal graph (adjacency matrix) where all the non-outcome & non-treatment features are considered confounders (via Backdoor Criterion from dowhy). This is different when compared to using ICALiNGAM which does not necessarily produce results with such a trend. Could it be due to the way in which DirectLiNGAM tries to identify the causal order (and therefore the adjacency matrix)?

Thank you very much!

sshimizu2006 commented 1 year ago

I guess that no implementation that satisfies both of the points is available even in other packages. A possibility is to modify some implementations based on conditional independence like PC and FCI by changing conditional independence tests depending on the variable types of variables tested (https://link.springer.com/article/10.1007/s41060-018-0097-y). Maybe https://www.jstatsoft.org/article/view/v080i07 might be helpful. But, in your case, it seems treatment and outcome cannot cause the other continuous variables. So, your way of using prior knowledge to analyze the mixed data would be ok.
Your dataset has many rows. HSIC seems to get much slower for larger sample sizes. There might be other faster statistical independence tests (e.g., https://arxiv.org/abs/1804.02747), but I haven't used them.
Well, ICA-LiNGAM does not allow prior knowledge used. This might have made the difference. If you want more sparseness, this issue might be helpful: https://github.com/cdt15/lingam/issues/68

sshimizu2006 commented 1 year ago

p.s. It would be better to use a different method, e.g., something implemented in dowhy, to compute causal effects from your continuous variables on the binary treatment based on an estimated causal graph rather than using the output of DirectLiNGAM. DirectLiNGAM assumes all the variables are continuous when it computes the causal effects.

soya-beancurd commented 1 year ago

Thanks for your reply!

With regards to your first reply,

Thanks! I'll have a look at the MXM package and check if there's a python equivalent (as I currently can only use python)
I'll read up and try to implement the FCIT algorithm in place of hsic_test_gamma from the paper you suggested for independence tests! (probably using the fcit python package from the author of the FCIT paper) Out of curiosity, are there any plans to incorporate other forms of independence test such as FCIT for the lingam package?
Increasing sparsity (i.e., gamma = 2 in adaptive lasso) unfortunately does not seem to alter the abovementioned trend for results of DirectLiNGAM

With regards to your second reply, I am currently only using the output of DirectLiNGAM as our estimated causal graph.

For instance, the adjacency matrix from DirectLiNGAM is converted to a NetworkX Digraph, before being pushed directly to DoWhy, where it interprets all non-zero entries as an edge and zero entries as no edge. Backdoor criterion in dowhy is then applied to identify confounders within this graph. Therefore, in a way, I am only relying on DirectLiNGAM's adjacency matrix of zero and non-zero values (the magnitude and sign of these values do not matter I guess), and not the causal effects that the LiNGAM class provides.

Were your concerns referring to the fact that the values in the adjacency matrix (not just about zeros and non-zeros) are still employed during independence testing? (i.e., hsic_test_gamma or even FCIT)

Thanka once again for the speedy and helpful replies!

sshimizu2006 commented 1 year ago

Ok, then, the direct edges from the continuous variables to the binary treatment might not be properly pruned. DirectLiNGAM uses sparse linear regression to prune directed edges assuming all the variables are continuous. Some other methods like sparse logistic regression having the continuous variables as explanatory variables and the treatment as the response variable would be better to estimate the existence of directed edges from those continuous variables on the binary treatment (and outcome), though DirectLiNGAM can estimate the causal structure of those continuous variables.

soya-beancurd commented 1 year ago

Based on your recommendation, would such a scenario below work?

Remove all binary variables, except the treatment (T) and outcome (Y)
Using the same prior knowledge matrix (T —> Y, and no other outgoing edge from T & Y), run DirectLiNGAM
Specifically, for a continuous variable x_i, adaptive lasso regression (i.e., predict_adaptive_lasao) is used. We do not worry about the contribution/coefficient (β) of T and Y here as β_T and β_Y will be 0 according to the prior matrix
When it’s time to do the regression for T and Y based on some causal order identified at the start, we instead apply some version of sparse regularized (L1 based) logistic regression, where for example if we’re regressing Y on the set of variables that has an earlier causal order (including T):

log(Y / 1 - Y) = βX₁ + βX₂ + βT + ...

Is it therefore safe to assume the β obtained from the sparse logistic regression above can be used as the values for the adjacency matrix?

sshimizu2006 commented 1 year ago

My suggestion would be something like

Remove all binary variables, apart from the treatment (T) and outcome (Y)
Run DirectLiNGAM on all the continuous variables to get a causal graph of the continuous variables.

2 and 3. Adaptive logistic regression (4.1 of the original adaptive lasso paper: http://users.stat.umn.edu/~zouxx019/Papers/adalasso.pdf) having all the continuous variables as the features and binary treatment as the target. Do the same for the binary outcome. Draw directed edges from the continous variables to the treatment and target based on the sparse patterns of the sparse adaptive logistic regression coefficients.

Give the causal graph of the continuous variables, binary treatment and outcome to doWhy.

soya-beancurd commented 1 year ago

Hi Dr. Shimizu, the plan that you've suggested seems to be going well so far, and there isn't any downstream issues (e.g., confounder identification and causal estimation) as of now! Thanks so much for your assistance and quick replies!!

I've also tried to replace HSIC with unconditional FCIT (fcit package), which does not seem to have caused any OOM issue thus far!

However, I'd still like to clarify some doubts on your implementation of HSIC:

If for example our sole purpose is to prune edges discovered by DirectLiNGAM (i.e., convert entries in the adjacency matrix B to 0 if their HSIC p-values ≤ 0.05), what would be the reasoning for carrying out HSIC even for variable pairs with no edges at all (B_ij = B_ji = 0)? Or would there not be a need to carry out independence tests for such sets of variables?

Thank you!

sshimizu2006 commented 1 year ago

Hi,

DirectLiNGAM tries to find a DAG that minimizes dependence between error terms. DirectLiNGAM does not use HSIC to prune edges. Rather, HSIC is used to see if the error terms in the estimated DAG are independent. Thiis is to find possible violations of the independence assumption.

soya-beancurd commented 1 year ago

Oh I see, thanks for the clarification! Do you then think it's possible to use such independence tests (HSIC or FCIT) to prune edges derived from the adjacency matrix of DirectLiNGAM as described in my question above?

sshimizu2006 commented 1 year ago

Yeah, that could be an alternative way for pruning edges.

cdt15 / lingam

Questions about usage of DirectLiNGAM #78