BiomedSciAI / causallib

A Python package for modular causal inference analysis and model evaluations
Apache License 2.0
728 stars 97 forks source link

Dependency housekeeping: sklearn>=1.2, numpy>2, pandas #72

Closed ehudkr closed 3 months ago

ehudkr commented 3 months ago

Adjust causallib and its tests for scikit-learn version >=1.2.0. And do additional housekeeping adjusting for ever-evolving changes in causallib's dependencies.

Starting version 1.2.0 Scikit-learn enforced named inputs (i.e., dataframes) to have all their columns be of a single type. Namely, you could not fit an sklearn model on a dataframe with an integer column name (e.g., 0) and a string column name (e.g., "x1"). This has caused causallib to crash depending on the provided input (especially the ones in its tests). This is because causallib may often join the covariates X and treatment a (Standardization) and some propensity-based features (PropensityFeatureStandardization). And if the name given to a and/or the propensity feature is, for example, a string, while X columns are integers, it would've cause sklearn to crash. Note that non-string column names in dataframes seem to be anti-pattern, and therefore if the user provided X with string columns and a with a string name, no error should've arise, 2) the proposed solution will tend to stringify columns/names.

The proposed solution is mostly a "safe join" in which X's columns and a's name are evaluated for their types. If X's columns are of mixed types they are converted to strings. If X.columns and a.name type do not agree then there will be a preference to convert them all to strings. After this name sanitization, they are concatenated.

In addition to adjusting for scikit-learn>=1.2, there were additional changes to adjust the package for numpy>=2 and for pandas>=2.1.