akelleh / causality

Tools for causal analysis
MIT License
1.06k stars 128 forks source link

categorical attributes #73

Closed wangqianwen0418 closed 2 years ago

wangqianwen0418 commented 5 years ago

In the given example, all the attributes are continuous numbers.
What should I do if I have categorical attributes that are expressed as string?

morenoh149 commented 5 years ago

have you tried one hot encoding? https://devdocs.io/scikit_learn/modules/generated/sklearn.preprocessing.onehotencoder#sklearn.preprocessing.OneHotEncoder

bhavyaghai commented 5 years ago

One hot encoding will increase the number of columns in the dataset and hence the number of nodes in the causal dag. For eg., If there is a gender column in the dataset, one hot encoding will create two new columns namely, males and females. Thereafter, in the causal DAG, we will see two nodes instead of a single node gender. Do you have any answers for this issue?

akelleh commented 2 years ago

This is a tricky problem! In general, mixed data types are hard because conditional independence tests with mixed data types don't work well. The discrete tests handle continuous variables out of the box when you specify data types correctly, but there isn't currently a test in the package which can handle mixed data types.

A common recommendation is to discretize continuous variables to be discrete, but note that this comes with information loss, and in general will result in the discretized variables failing to d-separate nodes they otherwise would separate!

I've written an experimental test which bootstraps from the original data, discretizes, and derives a test statistic for the discretized variable. it works well in practice. I had it in the package at one point, but removed the implementation since it was too experimental. In particular, it depended on a KDE estimate to boostrap from, which suffers from the curse of dimensionality in any reasonably-sized data set. Instead, I'll implement a version which can boostrap from sampling the observed data (which will still have that problem, but should at least give better performance for larger data sets).