Cannot reproduce the results of Balanced ERM?

Thank you for your interest in our work!

I wouldn't say that the results shown here contradict what we have in the paper, as all of the AUROCs presented here seem to fall within the 95% CIs from Figure D5. If possible, I would recommend training the 5 models with different data splits, and then using bootstrapping to generate confidence intervals as we did in the paper. Without any confidence intervals, it is hard to make claims about a model performing better or worse than another, as there is a lot of variance in performance due to the training procedure, data splits, and evaluation cohort.

For age, looking at the comparator plot (Figure D6), we found that Balanced ERM does not significantly outperform ERM on AUROC for any age group. This seems to concur with your results, given that the CIs for age tend to be quite large.

For sex in Figure D6, Balanced ERM seems to barely outperform ERM with significance, though the gain in AUROC is less than 1%, and there is still a large overlap in their CIs in Figure D5. I would conjecture that your observations are just a normal part of variance.

Regardless, I would say that the main result from our paper was that none of the other benchmarked minimax fairness methods outperform Balanced ERM, not necessarily that Balanced ERM outperforms ERM. In the chest X-ray setting, I would not expect data balancing to make a huge difference (compared to the datasets evaluated in e.g. [1]), though we did find isolated cases where it does do significantly better.

I hope that answers your question, and let me know if you run into any other issues!

[1] https://arxiv.org/pdf/2110.14503.pdf

MLforHealth / CXR_Fairness

Cannot reproduce the results of Balanced ERM? #1