A comprehensive set of fairness metrics for datasets and machine learning models, explanations for these metrics, and algorithms to mitigate bias in datasets and models.
In the internal valid_ifthens function, there are 2 points where there exist hard-coded feature names in the code, and these parts of the code fail without them. Specifically:
here, there is a reference to an age column of X, which is also assumed to be of pd.Interval dtype. As such, if this column does not exist or is not of Interval dtype, this part of the code throws an error.
Example to reproduce:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from aif360.sklearn.datasets.openml_datasets import fetch_german
from aif360.sklearn.detectors.facts import FACTS
X, y = fetch_german()
assert (X.index == y.index).all()
X.reset_index(drop=True, inplace=True)
y = y.reset_index(drop=True).map({"bad": 0, "good": 1})
# split into train-test data
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, stratify=y)
categorical_features = X.select_dtypes(include=["object", "category"]).columns.to_list()
categorical_features_onehot_transformer = ColumnTransformer(
transformers=[
("one-hot-encoder", OneHotEncoder(), categorical_features)
],
remainder="passthrough"
)
model = Pipeline([
("one-hot-encoder", categorical_features_onehot_transformer),
("clf", LogisticRegression(max_iter=1500))
])
#### train the model
model = model.fit(X_train, y_train)
detector = FACTS(
clf=model,
prot_attr="sex",
feature_weights={f: 1 for f in X.columns},
feats_not_allowed_to_change=[]
)
detector = detector.fit(X_test)
The last command fails with AttributeError: 'numpy.float64' object has no attribute 'left'
At this point, the recIsValid function is used. This, in turn, here also references hard-coded feature names. Here there exist checks of whether they exist or not, so the code does not fail if they do not exist. But there are cases where if a feature exists, it is assumed either to be of a certain type or to possess certain semantics.
I do not currently have a reproducible example for this one, because whether it will appear or not depends on the exact test data. I believe, however, that it is clear this is also a bug, and if we want to enforce some constraints, such as this part of the code is trying to do, it should be done in some other, more robust way.
In the internal
valid_ifthens
function, there are 2 points where there exist hard-coded feature names in the code, and these parts of the code fail without them. Specifically:age
column ofX
, which is also assumed to be ofpd.Interval
dtype. As such, if this column does not exist or is not of Interval dtype, this part of the code throws an error.Example to reproduce:
The last command fails with
AttributeError: 'numpy.float64' object has no attribute 'left'
recIsValid
function is used. This, in turn, here also references hard-coded feature names. Here there exist checks of whether they exist or not, so the code does not fail if they do not exist. But there are cases where if a feature exists, it is assumed either to be of a certain type or to possess certain semantics.I do not currently have a reproducible example for this one, because whether it will appear or not depends on the exact test data. I believe, however, that it is clear this is also a bug, and if we want to enforce some constraints, such as this part of the code is trying to do, it should be done in some other, more robust way.