LAION-AI / CLIP_benchmark

CLIP-like model evaluation
MIT License
590 stars 75 forks source link

add fairness related evaluations #15

Open mitchellnw opened 1 year ago

mitchellnw commented 1 year ago

could be good to have some fairness related datasets, e.g., from https://arxiv.org/abs/2108.02818. curious how LAION CLIP compares to OAI CLIP.

rom1504 commented 1 year ago

Yes sounds great

On Thu, Oct 6, 2022, 21:02 Mitchell Wortsman @.***> wrote:

could be good to have some fairness related datasets, e.g., from https://arxiv.org/abs/2108.02818. curious how LAION CLIP compares to OAI CLIP.

— Reply to this email directly, view it on GitHub https://github.com/LAION-AI/CLIP_benchmark/issues/15, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437T6NDHBI7VF4ZT3CT3WB4O5BANCNFSM6AAAAAAQ65JNMI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

mehdidc commented 1 year ago

Also https://github.com/facebookresearch/vissl/blob/main/projects/fairness_indicators/README.md

mehdidc commented 1 year ago

FairFace: https://github.com/joojs/fairface, will add fairface based on the excellent notebook https://colab.research.google.com/drive/13f8B2698YWcbCmApe8IdlAm3oAoGZ4D8?usp=sharing#scrollTo=5b1VxTcfYhSz from @Rijgersberg

Rijgersberg commented 1 year ago

Ah yes it would be very interesting to see how other CLIP-like models behave on this dataset. Especially since I was unable to replicate the results from the CLIP paper and never got a response from OpenAI on that.

mehdidc commented 1 year ago

@Rijgersberg I could reproduce your numbers with ViT-L-14-336, so yes not sure why there is a big difference with results reported in CLIP paper, I tried to play a bit with the prompts, but it does not change much, especially for non-human categories which stay very low.

mehdidc commented 1 year ago

On the other hand, for results on gender or race prediction only (Table 3 from CLIP paper), accuracy is not exactly the same but close. On race prediction I get 59.2% (CLIP reports 58.3%), on gender prediction I get 96.2% (CLIP reports 95.9%).

mehdidc commented 1 year ago

The setup they had feels anyway weird to me, I am not sure why crime-related classes are added to the existing (gender/race) classes and we ask the classifier to choose between gender/race and crime-relate classes, I think it should be more like multi-label classification. Maybe retrieving images from FairFace with crime-related prompts with a certain threshold distance, then just plot the distribution of race/gender ?

groovy-lizard commented 2 months ago

On the other hand, for results on gender or race prediction only (Table 3 from CLIP paper), accuracy is not exactly the same but close. On race prediction I get 59.2% (CLIP reports 58.3%), on gender prediction I get 96.2% (CLIP reports 95.9%).

Hello @mehdidc, how are you?

I'm trying to replicate these results using the notebook but to no avail, could you share some details on how you did it?

Here is a snippet of how I was trying to achieve this:


fface_df = pd.read_csv("./data/fairface/fface_val.csv")
fface_df.drop(columns=['service_test'], inplace=True)
fface_df['race_labels'] = fairface_labels
fface_df['race_preds'] = predictions

white_df = fface_df[fface_df['race'] == 'White']
non_white_df = fface_df[fface_df['race'] != 'White']

print("** FairFace dataset validation split 0.25 **")
print("Race accuracy of 'White' race images: ")
print(round(accuracy_score(white_df['race_labels'], white_df['race_preds']), 4))
print("Race accuracy of all other races grouped as 'Non-White':")
print(round(accuracy_score(non_white_df['race_labels'], non_white_df['race_preds']), 4))