BiomedSciAI / causallib

A Python package for modular causal inference analysis and model evaluations
Apache License 2.0
709 stars 96 forks source link

ipw.estimate_population_outcome(Data_X, Data_a, Data_y) has some error. #60

Closed woonjeung closed 10 months ago

woonjeung commented 1 year ago

Hello. I want to check the causal effect.

My code is

from causallib.estimation import IPW
from sklearn.linear_model import LogisticRegression

learner = LogisticRegression(penalty="l1", C= 1, max_iter=500, solver='liblinear')
ipw = IPW(learner)
ipw.fit(Data_X, Data_a)
potential_outcomes = ipw.estimate_population_outcome(Data_X, Data_a, Data_y) 
causal_effect = ipw.estimate_effect(potential_outcomes[1], potential_outcomes[0])
causal_effect

until ipw.fit(Data_X, Data_a), there is no problem. But when I apply potential_outcomes = ipw.estimate_population_outcome(Data_X, Data_a, Data_y) there is an error "### /root/miniforge3/envs/python_3_9/lib/python3.9/site-packages/numpy/lib/function_base.py:550: RuntimeWarning: invalid value encountered in double_scalars avg = np.multiply(a, wgt,"

and I check the ipw.compute_weights. w = ipw.compute_weights(Data_X, Data_a) This code doesn't have an error, but the values of w are all NaN. 0 NaN 1 NaN 2 NaN 3 NaN 4 NaN ..

What is the problem?

ehudkr commented 1 year ago

Hi, thanks for raising this question. The code looks ok, so my guess is that there is some problem with the data and/or modeling. The error you're getting is probably due to the IP-weights being NaNs. In turn, this could result because you have very extreme propensity scores (close to 0 or 1), so when you take their inverse you essentially divide by zero - for which Numpy returns a nan.

So first thing I would do is to see if your data is too separable, resulting in extreme probabilities to be treated. Note that if that's the case you might have a positivity/overlap violation, that might render your resulted associations not causal. You can use causallib's evaluation tools for that:

from causallib.evaluation import evaluate
ipw_evaluation = evaluate(ipw, Data_X, Data_a, Data_y)
ipw_evaluation.plot_weight_distribution()

Alternatively, if you are confident in your data, you can probably solve the extreme weights by clipping/trimming the propensity scores. You can do that by adding, for example, IPW(learner, clip_min=0.001, clip_max=0.999). Note that if you'll also add verbose=True, you'll get a print saying the proportion of observations that were clipped. Keep an eye on that. Ideally, this number should be low - the larger it is the more observations are affected by the caliper value and not their model-estimated probability, and if you clip too many of your dataset than your actual modeling (here LASSO) has less of an impact.

ehudkr commented 10 months ago

I'll take this nonresponsiveness as a good sign :-) I'm doing some housekeeping so I'm closing this issue for now. Feel free to re-open if needed.