BiomedSciAI / causallib

A Python package for modular causal inference analysis and model evaluations
Apache License 2.0
728 stars 97 forks source link

ipw.estimate_population_outcome(Data_X, Data_a, Data_y) has some error. #60

Closed woonjeung closed 1 year ago

woonjeung commented 1 year ago

Hello. I want to check the causal effect.

My code is

from causallib.estimation import IPW
from sklearn.linear_model import LogisticRegression

learner = LogisticRegression(penalty="l1", C= 1, max_iter=500, solver='liblinear')
ipw = IPW(learner)
ipw.fit(Data_X, Data_a)
potential_outcomes = ipw.estimate_population_outcome(Data_X, Data_a, Data_y) 
causal_effect = ipw.estimate_effect(potential_outcomes[1], potential_outcomes[0])
causal_effect

until ipw.fit(Data_X, Data_a), there is no problem. But when I apply potential_outcomes = ipw.estimate_population_outcome(Data_X, Data_a, Data_y) there is an error "### /root/miniforge3/envs/python_3_9/lib/python3.9/site-packages/numpy/lib/function_base.py:550: RuntimeWarning: invalid value encountered in double_scalars avg = np.multiply(a, wgt,"

and I check the ipw.compute_weights. w = ipw.compute_weights(Data_X, Data_a) This code doesn't have an error, but the values of w are all NaN. 0 NaN 1 NaN 2 NaN 3 NaN 4 NaN ..

What is the problem?

ehudkr commented 1 year ago

Hi, thanks for raising this question. The code looks ok, so my guess is that there is some problem with the data and/or modeling. The error you're getting is probably due to the IP-weights being NaNs. In turn, this could result because you have very extreme propensity scores (close to 0 or 1), so when you take their inverse you essentially divide by zero - for which Numpy returns a nan.

So first thing I would do is to see if your data is too separable, resulting in extreme probabilities to be treated. Note that if that's the case you might have a positivity/overlap violation, that might render your resulted associations not causal. You can use causallib's evaluation tools for that:

from causallib.evaluation import evaluate
ipw_evaluation = evaluate(ipw, Data_X, Data_a, Data_y)
ipw_evaluation.plot_weight_distribution()

Alternatively, if you are confident in your data, you can probably solve the extreme weights by clipping/trimming the propensity scores. You can do that by adding, for example, IPW(learner, clip_min=0.001, clip_max=0.999). Note that if you'll also add verbose=True, you'll get a print saying the proportion of observations that were clipped. Keep an eye on that. Ideally, this number should be low - the larger it is the more observations are affected by the caliper value and not their model-estimated probability, and if you clip too many of your dataset than your actual modeling (here LASSO) has less of an impact.

ehudkr commented 1 year ago

I'll take this nonresponsiveness as a good sign :-) I'm doing some housekeeping so I'm closing this issue for now. Feel free to re-open if needed.