fani-lab / OpeNTF

Neural machine learning methods for Team Formation problem.
Other
20 stars 15 forks source link

Static Curriculum Learning Using Popularity Labels #228

Open rezaBarzgar opened 11 months ago

rezaBarzgar commented 11 months ago

To define a static difficulty measurer for the task of neural team formation, we can use the popularity labels for each team. Assuming that we have the popularity label for each team, we can use torch.utils.data.SubsetRandomSampler to customize the proportion of popular and non-popular teams in each batch. There can be two different approaches to applying CL on this task:

Currently, we only have popularity labels for each individual expert, not teams. One possible solution that comes to my mind is that we can assign a popularity label for a team based on the number of popular/non-popular experts in the team. For example, a team with a majority of popular experts can be considered a popular team.

@hosseinfani, since the epoch-based approach is more common in the CL literature, I’m starting with this. I put these here to confirm the popularity labeling for the teams and the static CL approach with you.

hosseinfani commented 11 months ago

@rezaBarzgar "a team with a majority of popular experts" >> you need to specify what "majority" means, i.e., 60%, ..., 90%, 100% of a team? Also it may depend on domain. Like in a paper, a team with 1-2 popular authors out of 4-5 authors (teams' average size), in movies, a popular movie's casncrow are all (90-100%) popular.

Anyways, you need to specify a reasonable percentage and see the results.

rezaBarzgar commented 11 months ago

I calculate popularity labels for each team based on the proportion of popular experts in the team for imdb. If the proportion of popular experts in a team is greater than the specified proportion, the team is labelled as popular; otherwise, it is labelled as not popular.

Here is the code (I'll also push with my next updates):

import torch
import pandas as pd
import numpy as np
import pickle

def label_generator(vecs_path, expert_popularity_label_path, proportion):
    with open(vecs_path + '/teamsvecs.pkl', 'rb') as file:
        teamsvecs = pickle.load(file)
    experts_popularity_label = pd.read_csv(expert_popularity_label_path, index_col='memberidx').to_numpy().squeeze()
    team_popularity_label = []
    for idx, team in enumerate(teamsvecs['member']):
        experts = team.rows[0]
        populars_count = experts_popularity_label[team.rows[0]].sum()
        team_popularity_label.append(True if (populars_count / len(experts)) > proportion else False)

    team_popularity_label = np.array(team_popularity_label)
    print(f'percentage of popular teams: {(team_popularity_label.sum() / len(team_popularity_label)) * 100}')

if __name__ == '__main__':
    vecs_pth = './data/preprocessed/imdb/title.basics.tsv.filtered.mt75.ts3'
    expert_popularity_label_pth = './data/preprocessed/imdb/popularity.imdb.mt75.csv'
    for proportion in [0.1, 0.3, 0.5, 0.7, 0.9]:
        print(f'proportion: {proportion}')
        label_generator(vecs_pth, expert_popularity_label_pth, proportion)

Here are the results for different proportions: proportion: 0.1 percentage of popular teams: 86.4 proportion: 0.3 percentage of popular teams: 82.7 proportion: 0.5 percentage of popular teams: 66.8 proportion: 0.7 percentage of popular teams: 52.7 proportion: 0.8 percentage of popular teams: 42.8 proportion: 0.9 percentage of popular teams: 40.7

hosseinfani commented 11 months ago

so go ahead with 0.7 but schedule the runs for all other proportions, also include 0.0 and 1.0 for testing purposes.