Trusted-AI / AIF360

A comprehensive set of fairness metrics for datasets and machine learning models, explanations for these metrics, and algorithms to mitigate bias in datasets and models.
https://aif360.res.ibm.com/
Apache License 2.0
2.46k stars 840 forks source link

Using BinaryLabelDataset with pyspark dataframe #303

Closed ilirosmanaj closed 2 years ago

ilirosmanaj commented 2 years ago

Hi,

I am already using BinaryLabelDataset for generating fairness metrics and it works rather fine with average size dataframes. Now, due to some preprocessing steps in one of my pipelines, I need much more memory and need to support large csv files (e.g. 10GB+) and switched to using pyspark.

My question is: does BinaryLabelDataset also work with pyspark dataframe or I need to convert pyspark dataframe it to pandas dataframe (and basically kind of loosing the distributed property of pyspark by doing this and still risking of memory overflow)?

Thanks in advance

nrkarthikeyan commented 2 years ago

BinaryLabelDataset would not natively work with pyspark dataframes. You have to manually convert it to pandas and then use it.