Libr-AI / fairlib

A framework for assessing and improving classification fairness.
https://hanxudong.github.io/fairlib/
Apache License 2.0
33 stars 8 forks source link

Issues reproducing for Bios #26

Open AntoineGourru opened 1 year ago

AntoineGourru commented 1 year ago

Dear Xudong,

First, a great thanks for your work, this is of high value for people working in fair classification. Kudos !

Second, I have some issue reproducing the results for the Bios dataset.

I used your code to download and preprocess the data : datasets.prepare_dataset("bios", "data/bios")

After that, in src/dataloaders/loaders/Bios.py, I had to comment:

if self.args.protected_task in ["economy", "both"] and self.args.full_label:

if self.args.protected_task in ["gender", "economy", "both", "intersection"] and self.args.full_label:

Otherwise it couldn't build the datalaoder (because the data built with prepare_dataset does not contain economy_label).

Finally, I run this code:

############## args = { "dataset": "Bios_gender", "emb_size": 768, "num_classes": 28, "batch_size": 16, "data_dir": "data/bios", "device_id": 0, "exp_id":"fcl", }

debias_options = fairlib.BaseOptions() debias_state = debias_options.get_state(args=args, silence=True)

fairlib.utils.seed_everything(2022)

debias_model = fairlib.networks.get_main_model(debias_state)

debias_model.train_self()

##############

Everything run well, except the model get random results and the loss is not improving over the epochs. Do you have a clue about what is happening ?

For Moji, it works perfectly.

Best regards, and thank you again for your work,

Antoine

HanXudong commented 1 year ago

Hi Antoine,

Thanks for reaching out!

Regarding the Bios dataset, the augmented Bios dataset with economy labels is recently released, and I will revise the preprocessing script to add it soon.

For Bios experiments, I noticed that the batch size is set to 16 ("batch_size":16), which might be too small given the default learning rate ("lr":0.003). Could you please test with larger batch sizes or smaller learning rates? Hopefully this would help. Otherwise, feel free to share your codes, I am more than happy to help!

Best, Xudong

AntoineGourru commented 1 year ago

II reduced the lr and increased the batch size, it seems to work much better, thank you very much.

Many thanks for this work and for your kind answer,

Antoine

AntoineGourru commented 1 year ago

Dear Xudong,

Thanks again for your kind and prompt answer :-) . We still do not manage to reach the results you demonstrate in your articles (getting close but not exactly).

For example for the CE baseline, we reach 79.05 as max accuracy on the BiasInBios dataset.

Could you possibly share the parameters you used for BiasInBios and Moji (the optimal ones leading to the results in your papers). Similarly for the other methods ?

Best regards,

Antoine and Thibaud (@LetenoThibaud)

HanXudong commented 1 year ago

Hi Antoine and Thibaud,

Once again, thanks for reaching out!

Please be aware that we used fixed encoder models (e.g. BERT) in our previous experiments, and only trained MLP to make predictions. In our recent experiments, we tried to fine-tuned the whole model and further improve the results. To fine-tune the whole BERT model, could you please:

  1. set n_freezed_layers = 0 in you BERT model class (or in https://github.com/HanXudong/fairlib/blob/909f95237e26ed41d15f5777ba54ce4863e1f0c8/fairlib/src/networks/classifier.py#L147)
  2. set batch_size = 32
  3. set learning rate lr = 5e-6

In terms of the hyperparameters of each debiasing methods, we used the same batch size and learning rate as the vanilla method, and only search for the best trade-off hyperparameters for each debiasing method. The corresponding results are can downloaded. I have attached a jupyter notebook to demonstrate the process, which can be run in Google Colab. Please have a look and fell free to message me for any further information.

Reproduce_Results.zip

Best, Xudong

LetenoThibaud commented 1 year ago

Hi Xudong,

Thank you for your quick answer,

Based on your code and by using the data downloaded from the notebook you sent, we managed reproducing your vanilla results.

This will be very helpful for our works, thanks again.

Best regards,

Thibaud