Use case for categorical datasets

BiomedSciAI / causallib

A Python package for modular causal inference analysis and model evaluations

Apache License 2.0

728 stars 97 forks source link

Use case for categorical datasets #55

Closed jgdpsingh closed 1 year ago

jgdpsingh commented 1 year ago

I am trying to use the library for a survey dataset where no entry is numerical and all the responses are categorical in nature. On using Causal Inference 360's evaluation plots, the results were not very encouraging, i.e. wide chasms between weighted and unweighted variables in propensity plots.

Also, Boolean Rules via Column Generation (BRCG) method didn't return any rule. Presumably because no entry was numerical. The result was this

Learning DNF rule with complexity parameters lambda0=0.001, lambda1=0.001

Initial LP solved Iteration: 1, Objective: 0.2203 Accuracy: 0.7797356828193832 AUC: 0.5 ['']

Can this library be used to find out causal relationships between categorical variables? If yes, can you share any notebook or example for the same?

ehudkr commented 1 year ago

Hi, I'm sorry to hear you encountered some setbacks during your analysis. I'll need some more details to be able to be helpful. Do you have a sample of the data I could use to reproduce the problem (or can you synthesize a minimal example causing problems)? How was the propensity model defined? How do the evaluation plots look exactly?

On the face of it, there shouldn't be any constraints for using categorical variables. [Except if you have >=3 treatment levels, in which case most evaluations are not well-defined, but estimation should still be valid.]

As for the BRCG method, it is not part of causallib, so I can't speak for it. However, as far as I know, this method first binarizes continuous features, so having no-continuous features in the first place shouldn't be an obstacle.

jgdpsingh commented 1 year ago

sample data.csv

I have attached the results and the sample data.. I used logistic regression classifier on the training set.. then identified the treatment variable and applied the same codes as in your biomed example code.

Thanks for your help!

ehudkr commented 1 year ago

Hey @jgdpsingh , these plots actually look fairly good! 🙂 What do you think is the problem here?

jgdpsingh commented 1 year ago

Actually in the example of Bank Marketing, it was suggested that the mean differences in the covariate balance love plot should be minimised. And in my categorical dataset, the mean differences seemed a bit large. So I applied BCRG method on the same lines as that in example to find out those rules to minimise those differences. But none displayed. So just wanted to see if the library actually works for survey datasets.

But going by your feedback, I guess it does work well enough for surveys. Will test it further. Thanks a lot!

ehudkr commented 1 year ago

I apologize for the confusion in the Bank Marketing example. I will revisit it to see if I can revise the wording to make it better.

For the sake of completeness, the Love plot shows the absolute difference in means for each covariate (and the inverse probability weighted mean too). Large ASMD values can hint that the treatment groups are different and their covariate distribution is imbalanced. Large unweighted AMSD values are expected in non-randomized settings, because individuals self-select into treatment groups. However, if the weighting process is successful, the weighted ASMD values should decrease and ideally have them all as closest to zero as possible.

Thanks again for bringing this up, and good luck!