Trusted-AI / AIF360

A comprehensive set of fairness metrics for datasets and machine learning models, explanations for these metrics, and algorithms to mitigate bias in datasets and models.
https://aif360.res.ibm.com/
Apache License 2.0
2.46k stars 840 forks source link

Design and implement a mechanism to download datasets from Kaggle #341

Open anupamamurthi opened 2 years ago

anupamamurthi commented 2 years ago

Category

Datasets

Why is this issue important

At this point in time, AIF360 doesn't support an easy way to download datasets from places like Kaggle. One should manually download the data before running experiments.

How to go about this issue

To understand how datasets are used, we would recommend poking around https://github.com/Trusted-AI/AIF360/tree/master/aif360/data/raw and modules in https://github.com/Trusted-AI/AIF360/tree/master/aif360/datasets

The datasets are used in notebooks as seen here: https://github.com/Trusted-AI/AIF360/blob/master/examples/demo_meta_classifier.ipynb (This is one example but feel free to poke around various examples)

Once you get a sense on what is going on with the data and how the datasets are used, it will be easy to proceed with the implementation mentioned below.

Implementation / proposal

if sys.version_info >= (3, 4): ABC = abc.ABC else: ABC = abc.ABCMeta(str('ABC'), (), {})

from datasets_lib.utils import get_logger logging = get_logger(name)

class Store(ABC): @abc.abstractmethod def init(self, **kwargs): pass

@abc.abstractmethod
def validate_store(self, **kwargs):
    pass

@abc.abstractmethod
def download(self, **kwargs):
    pass

@abc.abstractmethod
def upload(self, **kwargs):
    pass


- [ ] Implement KaggleStore that will override the functions in the baseclass

Overall, the idea is to have the ability to download data from Kaggle using this helper.

There is no need to stick to the above definition. 

###  How to test ? 

Adding a good unit test will certainly help test the above logic/code.

If all goes well, it will nice to go over the datasets that are available to see if we can download them directly using   KaggleStore instead of hardcoding the location of the data as seen here: https://github.com/Trusted-AI/AIF360/tree/1de824717be15e2a0ebabe9bd8a718787196af73/aif360/datasets 
hoffmansc commented 2 years ago

Can you explain these methods a little more? What does upload do?

hoffmansc commented 2 years ago

We can also look into: https://www.kaggle.com/docs/api

Looks like you need an API key, though