Using BinaryLabelDataset with pyspark dataframe

Trusted-AI / AIF360

A comprehensive set of fairness metrics for datasets and machine learning models, explanations for these metrics, and algorithms to mitigate bias in datasets and models.

Apache License 2.0

2.46k stars 840 forks source link

Hi,

I am already using BinaryLabelDataset for generating fairness metrics and it works rather fine with average size dataframes. Now, due to some preprocessing steps in one of my pipelines, I need much more memory and need to support large csv files (e.g. 10GB+) and switched to using pyspark.

My question is: does BinaryLabelDataset also work with pyspark dataframe or I need to convert pyspark dataframe it to pandas dataframe (and basically kind of loosing the distributed property of pyspark by doing this and still risking of memory overflow)?

Thanks in advance

Trusted-AI / AIF360

Using BinaryLabelDataset with pyspark dataframe #303