Closed brave-cattle closed 2 years ago
Hello. Yes you are correct, taking the smaller class size (9 for heavy makeup and 182 for blond_hair) for every attribute pair is how we interpreted the original setup as well (which is taken from https://arxiv.org/pdf/2007.02561.pdf, including the baseline results)
Thank you so much for the quick and kind reply. I checked the code of LfF. It seems that they calculate Acc. separately for each combination, and then take the take the average. In principle, it is consistent with the practice you described. Whether such a setting will over amplify the impact of the combination with the least number of samples, because it seems to be much lower than other combinations. I'd like to hear your thoughts on this issue.
Yes, I do agree that such a small size is not ideal and can lead to a greater variability. You could definitely use the full valid set to obtain more precise results, or even build a new test (however you'd need to re-test all of the techniques)
Hello, thank you for your generous sharing. I did not find anything about CelebA in the code. Hope to confirm your experimental settings. Taking heavy_makeup as the target and gender as bias as an example, the number of samples of the four combinations of (heavy_makeup, gender) in the valid set is 3667, 8449, 7742, 9. Does the article say that taking the same number of samples is to select only 9 samples in each case? I use ResNet18 to train the baseline model to classify heavy_makeup. The correct classification numbers for the four combinations of (heavy_makeup, gender) are 2829/3667, 8444/8449, 6998/7742, 4/9. The result is significantly higher than the baseline in the article. So I want to confirm if I have misunderstood the description of the experimental setup in the article.