MLforHealth / CXR_Fairness

Improving the Fairness of Chest X-ray Classifiers
14 stars 1 forks source link

Cannot reproduce the results of Balanced ERM? #1

Closed ys-zong closed 2 years ago

ys-zong commented 2 years ago

Hi, thanks for the nice work and code!

When trying to reproduce the results, I found the Balanced ERM does not always perform better than the plain ERM. For example, in Figure D.5. in the paper, balanced ERM outperforms ERM for every corresponding subgroup. But interestingly, it seems the performance of Balanced ERM is worse than the ERM in the minority group (see pictures below--CheXpert dataset), which is contradictory to the results presented in the paper.

I'm following your exact preprocessing step and running the experiments through python -m cxr_fairness.train --xxx. I'm wondering is there any step I'm missing or have you met this situation before? Thanks!

image

hzhang0 commented 2 years ago

Thank you for your interest in our work!

I wouldn't say that the results shown here contradict what we have in the paper, as all of the AUROCs presented here seem to fall within the 95% CIs from Figure D5. If possible, I would recommend training the 5 models with different data splits, and then using bootstrapping to generate confidence intervals as we did in the paper. Without any confidence intervals, it is hard to make claims about a model performing better or worse than another, as there is a lot of variance in performance due to the training procedure, data splits, and evaluation cohort.

For age, looking at the comparator plot (Figure D6), we found that Balanced ERM does not significantly outperform ERM on AUROC for any age group. This seems to concur with your results, given that the CIs for age tend to be quite large.

For sex in Figure D6, Balanced ERM seems to barely outperform ERM with significance, though the gain in AUROC is less than 1%, and there is still a large overlap in their CIs in Figure D5. I would conjecture that your observations are just a normal part of variance.

Regardless, I would say that the main result from our paper was that none of the other benchmarked minimax fairness methods outperform Balanced ERM, not necessarily that Balanced ERM outperforms ERM. In the chest X-ray setting, I would not expect data balancing to make a huge difference (compared to the datasets evaluated in e.g. [1]), though we did find isolated cases where it does do significantly better.

I hope that answers your question, and let me know if you run into any other issues!

[1] https://arxiv.org/pdf/2110.14503.pdf