dccuchile / wefe

WEFE: The Word Embeddings Fairness Evaluation Framework. WEFE is a framework that standardizes the bias measurement and mitigation in Word Embeddings models. Please feel welcome to open an issue in case you have any questions or a pull request if you want to contribute to the project!
https://wefe.readthedocs.io/
MIT License
173 stars 14 forks source link

WEAT effect size: Different values #11

Closed santoshbs closed 3 years ago

santoshbs commented 3 years ago

I get slightly different values when using WEFE's weat.run_query()than when doing the calculations manually as below:

def getCosine(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

list_avg_diffs= []

avg_t1_a1= 0
avg_t1_a2= 0
avg_t1_diff= 0

for t in t1: #t1= ['brother', 'father', 'uncle', 'grandfather', 'son']
    x= w2v[t]
    for a in a1: #a1= ['science', 'technology', 'physics', 'chemistry', 'Einstein', 'NASA', 'experiment', 'astronomy']
        y= w2v[a]
        avg_t1_a1= avg_t1_a1 + getCosine(x,y)
    avg_t1_a1= avg_t1_a1/len(a1)

    for a in a2: #a2= ['poetry', 'art', 'Shakespeare', 'dance', 'literature', 'novel', 'symphony', 'drama']
        y= w2v[a]
        avg_t1_a2= avg_t1_a2 + getCosine(x,y)
    avg_t1_a2= avg_t1_a2/len(a2)

    avg_t1_diff= avg_t1_diff + (avg_t1_a1 - avg_t1_a2)
    list_avg_diffs.append(avg_t1_a1 - avg_t1_a2)

avg_t1_diff= avg_t1_diff/len(t1)

avg_t2_a1= 0
avg_t2_a2= 0
avg_t2_diff= 0

for t in t2: #t2= ['sister', 'mother', 'aunt', 'grandmother', 'daughter']
    x= w2v[t]
    for a in a1: #a1= ['science', 'technology', 'physics', 'chemistry', 'Einstein', 'NASA', 'experiment', 'astronomy']
        y= w2v[a]
        avg_t2_a1= avg_t2_a1 + getCosine(x,y)
    avg_t2_a1= avg_t2_a1/len(a1)

    for a in a2: #a2= ['poetry', 'art', 'Shakespeare', 'dance', 'literature', 'novel', 'symphony', 'drama']
        y= w2v[a]
        avg_t2_a2= avg_t2_a2 + getCosine(x,y)
    avg_t2_a2= avg_t2_a2/len(a2)

    avg_t2_diff= avg_t2_diff + (avg_t2_a1 - avg_t2_a2)
    list_avg_diffs.append(avg_t2_a1 - avg_t2_a2)

avg_t2_diff= avg_t2_diff/len(t2)

diff_of_diff= avg_t1_diff - avg_t2_diff
sd= np.std(list_avg_diffs) #use ddof=1 in case you need same sd as in R
weat_effect_size= diff_of_diff/sd

The weat_effect_size above is: 1.6132258039368363. However, that obtained using weat.run_query() is: 1.674.

Not sure why this difference of 0.06 in effect size. Request help.

santoshbs commented 3 years ago

@pbadillatorrealba - I was wondering if you had a chance to examine the difference in WEAT effect sizes obtained as above. Many thanks.

pbadillatorrealba commented 3 years ago

Hello @santoshbs

I am currently checking what you have exposed. I will have an answer for you during the day.

Best regards, Pablo.

pbadillatorrealba commented 3 years ago

Hello,

I found a problem in the code of your implementation. From what I understood, you are using the variables avg_t1_a1, avg_t1_a2, avg_t2_a1, avg_t2_a2 to calculate the averages.

The problem arises that when calculating the average of each group, you are not resetting the variables that accumulates the values in the next iteration. This implies that only the first average of each iteration of t1 and t2 is calculated correctly and the rest is calculated with the residual value of the past iterations.

Here is the code where I moved the accumulator assignments within the cycles. This code results in a value very similar to the one delivered by WEFE: 1.4170707838163572 with respect to 1.41707071 respectively.

# I assumed you were using word2vec loaded from the gensim interface.
import gensim.downloader as api
w2v = api.load("word2vec-google-news-300") 

def getCosine(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

list_avg_diffs = []

avg_t1_diff = 0
# deleted
# avg_t1_a1 = 0
# avg_t1_a2 = 0

for t in t1:  #t1= ['brother', 'father', 'uncle', 'grandfather', 'son']

    # -------------------------------------------------------------------------
    # I included here these lines to reset in each iteration the accumulators so 
    # that the average is calculated only with the values of this iteration.
    # -------------------------------------------------------------------------

    avg_t1_a1 = 0
    avg_t1_a2 = 0

    # ----------------------------------------------------------------

    x = w2v[t]
    for a in a1:  #a1= ['science', 'technology', 'physics', 'chemistry', 'Einstein', 'NASA', 'experiment', 'astronomy']
        y = w2v[a]
        avg_t1_a1 = avg_t1_a1 + getCosine(x, y)
    avg_t1_a1 = avg_t1_a1 / len(a1)

    for a in a2:  #a2= ['poetry', 'art', 'Shakespeare', 'dance', 'literature', 'novel', 'symphony', 'drama']
        y = w2v[a]
        avg_t1_a2 = avg_t1_a2 + getCosine(x, y)
    avg_t1_a2 = avg_t1_a2 / len(a2)

    avg_t1_diff = avg_t1_diff + (avg_t1_a1 - avg_t1_a2)
    list_avg_diffs.append(avg_t1_a1 - avg_t1_a2)

avg_t1_diff = avg_t1_diff / len(t1)

avg_t2_diff = 0

for t in t2:  #t2= ['sister', 'mother', 'aunt', 'grandmother', 'daughter']

    # ----------------------------------------------------------------
    # Same case here.
    # ----------------------------------------------------------------

    avg_t2_a1 = 0
    avg_t2_a2 = 0

    # ----------------------------------------------------------------

    x = w2v[t]
    for a in a1:  #a1= ['science', 'technology', 'physics', 'chemistry', 'Einstein', 'NASA', 'experiment', 'astronomy']
        y = w2v[a]
        avg_t2_a1 = avg_t2_a1 + getCosine(x, y)
    avg_t2_a1 = avg_t2_a1 / len(a1)

    for a in a2:  #a2= ['poetry', 'art', 'Shakespeare', 'dance', 'literature', 'novel', 'symphony', 'drama']
        y = w2v[a]
        avg_t2_a2 = avg_t2_a2 + getCosine(x, y)
    avg_t2_a2 = avg_t2_a2 / len(a2)

    avg_t2_diff = avg_t2_diff + (avg_t2_a1 - avg_t2_a2)
    list_avg_diffs.append(avg_t2_a1 - avg_t2_a2)

avg_t2_diff = avg_t2_diff / len(t2)

diff_of_diff = avg_t1_diff - avg_t2_diff
sd = np.std(list_avg_diffs)  #use ddof=1 in case you need same sd as in R
weat_effect_size = diff_of_diff / sd

On the other hand, the implementation to compare with WEFE was as follows.

import gensim.downloader as api

from wefe.word_embedding_model import WordEmbeddingModel
from wefe.metrics import WEAT
from wefe.query import Query

w2v = api.load("word2vec-google-news-300") 
model = WordEmbeddingModel(w2v, 'w2v')

t1= ['brother', 'father', 'uncle', 'grandfather', 'son']
a1= ['science', 'technology', 'physics', 'chemistry', 'Einstein', 'NASA', 'experiment', 'astronomy']
a2= ['poetry', 'art', 'Shakespeare', 'dance', 'literature', 'novel', 'symphony', 'drama']
t2= ['sister', 'mother', 'aunt', 'grandmother', 'daughter']

q = Query([t1, t2],[a1, a2],['Male terms', 'Female terms'], ['Science', 'Arts'])
WEAT().run_query(q, model, warn_not_found_words=True)

I noticed that the results are not so accurate because the calculations are done on float32 and not on float64. Maybe in the next release I will include modifications so that the calculations are done with better precision.

I hope I have helped you!

Regards, Pablo.

santoshbs commented 3 years ago

@pbadillatorrealba - Many thanks for the very kind help. Really appreciate you taking the time to identify the issue with my implementation of WEAT effect size.