Trusted-AI / AIF360

A comprehensive set of fairness metrics for datasets and machine learning models, explanations for these metrics, and algorithms to mitigate bias in datasets and models.
https://aif360.res.ibm.com/
Apache License 2.0
2.42k stars 827 forks source link

Port pre-processing algorithms to sklearn-compatible API #154

Open hoffmansc opened 4 years ago

hoffmansc commented 4 years ago
InterferencePattern commented 4 years ago

Will this make it easier/unnecessary to convert back and forth between Pandas DataFrames and Binary Label Datasets, for example? I've been having issues with Reweighing, as AIF360 tends to only work with numerical data but does not provide instructions for dummification while storing the metadata mappings to de-dummify later on.

hoffmansc commented 4 years ago

Yes, this will allow DataFrames to be used directly with the algorithms. Reweighing is already implemented so you can try it out if you're comfortable using the master branch from GitHub. It should be released in the latest stable version soon as well.

Do you mind explaining exactly what issues you were facing? Was it with convert_to_dataframe()?

InterferencePattern commented 4 years ago

Hi @hoffmansc, yes, I've been having trouble understanding how to use convert_to_dataframe() after creating my own BinaryLabelDataset. Perhaps it's my own fault, but I can't find the documentation that describes how to dummify the data in a way that retains the mappings so it can be reversed after using a PreProcessing tool such as Reweighing.

hoffmansc commented 4 years ago

@jimbudarz, if you encode your categorical data with pd.get_dummies(), or use StandardDataset, you will end up with feature_names that look like, e.g., [..., native-country=United-States, native-country=Vietnam, native-country=Yugoslavia, ...]. Then, if you do convert_to_dataframe(de_dummy_code=True), you will get a DataFrame that looks something like:

  ... native-country
0 ... United-States
1 ... United-States
2 ... Vietnam
...

with the columns magically mapped from one-hot to categories.

You can also include maps for the labels and protected attributes manually (since these are encoded differently) by supplying them when creating the BinaryLabelDataset (note: protected_attribute_maps should be in the same order as protected_attribute_names):

metadata = {
    'label_maps': [{1.0: '>50K', 0.0: '<=50K'}],
    'protected_attribute_maps': [{1.0: 'White', 0.0: 'Non-white'},
                                 {1.0: 'Male', 0.0: 'Female'}]
}
BinaryLabelDataset(..., metadata=metadata)

otherwise they will just be 0/1 which is probably also fine.

InterferencePattern commented 4 years ago

Thanks for the help- this led me to a resolution: Pandas' get_dummies() uses the separator prefixsep="" by default, and convert_to_dataframe() uses sep="=" by default.

It might be helpful to explain what sep attribute does in the https://aif360.readthedocs.io/en/latest/modules/datasets.html documentation.

hoffmansc commented 4 years ago

It might be helpful to explain what sep attribute does in the https://aif360.readthedocs.io/en/latest/modules/datasets.html documentation.

That's a good point. Would you be willing to write a quick PR to that effect?

InterferencePattern commented 4 years ago

I've gladly submitted a PR.

It looks like reversing dummy-encoding could soon become a part of pandas itself, which AIF360 may be able to leverage for scikit-learn compatibility: https://github.com/pandas-dev/pandas/pull/31795

razvanh commented 3 years ago

Is convert_to_dataframe() supposed to to return the original DataFrame? I am using Reweighing and get back a BinaryLabelDataset which I would like to convert back to a DataFrame(with the weights applied).

theBull commented 2 years ago

convert_to_dataframe() seems to return a tuple for me, which doesn't seem right. The docs for the aif360 adults dataset states that this method should:

Convert the StructuredDataset to a pandas.DataFrame.

However, it doesn't appear to do so.

from aif360.datasets import AdultDataset
ad = AdultDataset(
    protected_attribute_names=['sex'],
    privileged_classes=[['Male']],
    categorical_features=[],
    features_to_keep=['age', 'education-num']
)
df = ad.convert_to_dataframe()
print(type(df))
# <class 'tuple'>
theBull commented 2 years ago

Ah. convert_to_dataframe() returns two values (a tuple), as such:

dataframe, dictionary = dataset.convert_to_dataframe()

print(type(dataframe))
print(type(dictionary))
#<class 'pandas.core.frame.DataFrame'>
#<class 'dict'>

Silly mistake. I forget that when you return multiple values in python, it returns them as a tuple. I'm still new to the language. HTH.