Out of curiosity, we took the 0.25 datasets (train and val) and run those through the ResNet34 model and weights trained by the authors.
This 0.25 dataset is the one that the face align part in predict.py creates (so we assume it is equivalent to the one used and referred in the paper for 7 race classes).
Interestingly, the accuracy (match between labels in the original csv and predicted classes by the published model) differs between the train and val(test) datasets and both are lower than presented in the paper:
Train:
full set 85% (73386 / 86744)
service_test == True 84% (33794 / 40252)
Val:
full set 78% (8511 / 10954)
service_test == True 77% (3955 / 5162)
BTW, as it was previously discovered by fellow commenters, the filter service_test == True defines a subset where the labels are balanced in terms of race and gender. Therefore, we calculated metrics both for the full set and this subset.
We would have expected higher and consistent percentages.
The paper presents comparison tables where the accuracy of the model is 94%, which is not supported by these findings.
The drop in validation accuracy might suggest that the ResNet34 model (a simple one with fc head replacement) was trained on this dataset but evidently does not perform as the published results.
BTW, if one looks deeper into the images: some are very challenging (low res, profile, back of the head, low light etc). Whether and what it means regarding the data quality and balance of the dataset (level, consistency, distribution etc.) needs further consideration. For example, if lower quality images (the term TBD) are present in higher proportion within one class than within the others that can make the dataset out of balance (regardless what the label distribution suggests, because in that case some labels hold less or confusing information concentrated in a specific class). We did not perform such analysis just flag the potential here.
Please feel free to correct any inaccuracy or misinterpretation above or provide an explanation.
Out of curiosity, we took the 0.25 datasets (train and val) and run those through the ResNet34 model and weights trained by the authors. This 0.25 dataset is the one that the face align part in predict.py creates (so we assume it is equivalent to the one used and referred in the paper for 7 race classes).
Interestingly, the accuracy (match between labels in the original csv and predicted classes by the published model) differs between the train and val(test) datasets and both are lower than presented in the paper: Train:
BTW, as it was previously discovered by fellow commenters, the filter service_test == True defines a subset where the labels are balanced in terms of race and gender. Therefore, we calculated metrics both for the full set and this subset.
We would have expected higher and consistent percentages.
Please feel free to correct any inaccuracy or misinterpretation above or provide an explanation.