equality of opportunity fairness criteria

Hamedloghmani commented 1 year ago

This issue breaks down the required steps to implement equality of opportunity fairness criteria in Adila.

[x] 1. in this step we try to find a matrix that show which members had each of the skills. To do so we should do the following teamids, skillvecs, membervecs = teamsvecs['id'], teamsvecs['skill'], teamsvecs['member'] skill_member = skillvecs.transpose() @ membervecs in skill_member, rows would be skills and columns would be members
[x] 2. For each team, we find the required skills
[x] 3. Find the qualified set ( set of members that have all the required skills for the team)
[x] 4. Extract Popular vs Non-popular in qualified set and pass to re-ranker

Questions before the final implementation:

Which part of the pipeline is the correct spot to put this implementation (obviously should be before rerankfunction)?
Does row id in .pred files represent teamids in teamsvec file ?

hosseinfani commented 1 year ago

based on the splits.log, we know the rowid of a test instance like

rowid = 5

at teamsvecs['skills'][5] = {set of skills for team# 5} = {s12, s15, s3}

you have to find the rows {12,15,3} (the skill column idx) of teamsvecs in skill_member:

skill_member[12] = [0, 1, 3, 0, 0, ..., 2, 0] which are members that have at least participated in a team with s12.

You find the colmunidx for non-zeros.

s12: [0, 1, 3, 0, 0, ..., 2, 0] ==> {m1, m2, ..., m{|member|-2}} s15: [2, 0, 0, 0, 0, ..., 2, 0] ==> {m0, ..., m{|member|-2}} s3: [0, 1, 0, 0, 0, ..., 0, 0] ==> {m1}

we can consider this set as the qualified set for team#5 as the intersection/union of these sets

Hamedloghmani commented 1 year ago

This is my code block for the description you kindly mentioned above.

import pickle
import pandas as pd

with open('teamsvecs.pkl', 'rb') as f: teamsvecs = pickle.load(f)
teamids, skillvecs, membervecs = teamsvecs['id'], teamsvecs['skill'], teamsvecs['member']
skill_member = skillvecs.transpose() @ membervecs
popularity = pd.read_csv('popularity.csv')

ratios = list()
for i in range(skillvecs.shape[0]):
    skills = skillvecs[i].rows[0]
    qualified = list()
    for skill in skills:
        qualified.append(skill_member[skill].nonzero()[1])
    intersect = set(qualified[0]).intersection(*qualified)

    labels = list()
    for member in intersect:
        labels.append(popularity.loc[popularity['memberidx']==member, 'popularity'].tolist()[0])
    ratios.append(labels.count(False) / len(intersect))

hosseinfani commented 1 year ago

@Hamedloghmani how about this:

skill_indexes = teamsvecs['skills'][5].nonzero() or cols()
members = np.array(skill_member[skill_indexes]) ==> this raise an error: fix it please
intersect = reduce(lambda x, y: x & y, members).nonzero()
union = reduce(lambda x, y: x | y, members).nonzero()

import numpy as np
a = np.array([[1,0,0],[0,1,0],[0,0,1],[1,1,1]])
reduce(lambda x, y: x & y, a).nonzero()
reduce(lambda x, y: x | y, a).nonzero()

Hamedloghmani commented 1 year ago

Hi, @hosseinfani Today I have tried multiple variation of both our implementations. The following is the most efficient implementation that I came up with. It is the combination of our codes and I tried to make it close to your coding style. I also measured the runtime of different variations to be sure. I would be happy to have your opinion on this.

ratios = list()
for i in range(skillvecs.shape[0]):

    skill_indexes = skillvecs[i].nonzero()[1].tolist()
    members = [skill_member[idx].nonzero()[1] for idx in skill_indexes]
    intersect = set(members[0]).intersection(*members)
    labels = [popularity.loc[popularity['memberidx']==member, 'popularity'].tolist()[0] for member in intersect]
    ratios.append(labels.count(False) / len(intersect))

hosseinfani commented 1 year ago

@Hamedloghmani please go ahead with results. later we'll have time for better implementations. Also, are you sure with intersection or union would be a better choice? Intersection may result in empty results, so you need to skill those that ends up with empty set of qualified members.

Hamedloghmani commented 1 year ago

@hosseinfani Thanks for the feedback, my initial response somehow lost and not sent, I apologize for that.

And regarding the empty set, that's a solid point. I'm not sure yet what is the best way to handle it because even if we use logical &, we might have 0 for result. I went towards intersection since it was time efficient and also, since .nonzero() returns the results with different lengths, we required paddings. ( e.g. for skill 1 it return [ 12, 16, 43] and for skill 12 it return [12, 44, 67, 88, 95, 99]) I'll keep looking for a solution to address both of these issues.

fani-lab / Adila

equality of opportunity fairness criteria #63