Static Curriculum Learning Using Popularity Labels

rezaBarzgar commented 11 months ago

To define a static difficulty measurer for the task of neural team formation, we can use the popularity labels for each team. Assuming that we have the popularity label for each team, we can use torch.utils.data.SubsetRandomSampler to customize the proportion of popular and non-popular teams in each batch. There can be two different approaches to applying CL on this task:

Batch-based: changing the proportion of the popular and non-popular teams in each k batch in each epoch. So, in each epoch, we start with a batch with more popular teams and fewer non-popular teams, but at the end of each epoch, we have a batch with fewer popular and more non-popular teams. Generally, we will have some epochs that consist of batches with different difficulties.
Epoch-based (More common): In this approach, the difficulty level typically changes across epochs, not within individual batches. In the early epochs, more popular (easy) examples are presented to the model. As training progresses, in the last epochs, more non-popular (challenging) examples are introduced, encouraging the model to generalize and learn more complex patterns.

Currently, we only have popularity labels for each individual expert, not teams. One possible solution that comes to my mind is that we can assign a popularity label for a team based on the number of popular/non-popular experts in the team. For example, a team with a majority of popular experts can be considered a popular team.

@hosseinfani, since the epoch-based approach is more common in the CL literature, I’m starting with this. I put these here to confirm the popularity labeling for the teams and the static CL approach with you.

hosseinfani commented 11 months ago

@rezaBarzgar "a team with a majority of popular experts" >> you need to specify what "majority" means, i.e., 60%, ..., 90%, 100% of a team? Also it may depend on domain. Like in a paper, a team with 1-2 popular authors out of 4-5 authors (teams' average size), in movies, a popular movie's casncrow are all (90-100%) popular.

Anyways, you need to specify a reasonable percentage and see the results.

rezaBarzgar commented 11 months ago

I calculate popularity labels for each team based on the proportion of popular experts in the team for imdb. If the proportion of popular experts in a team is greater than the specified proportion, the team is labelled as popular; otherwise, it is labelled as not popular.

Here is the code (I'll also push with my next updates):

import torch
import pandas as pd
import numpy as np
import pickle

def label_generator(vecs_path, expert_popularity_label_path, proportion):
    with open(vecs_path + '/teamsvecs.pkl', 'rb') as file:
        teamsvecs = pickle.load(file)
    experts_popularity_label = pd.read_csv(expert_popularity_label_path, index_col='memberidx').to_numpy().squeeze()
    team_popularity_label = []
    for idx, team in enumerate(teamsvecs['member']):
        experts = team.rows[0]
        populars_count = experts_popularity_label[team.rows[0]].sum()
        team_popularity_label.append(True if (populars_count / len(experts)) > proportion else False)

    team_popularity_label = np.array(team_popularity_label)
    print(f'percentage of popular teams: {(team_popularity_label.sum() / len(team_popularity_label)) * 100}')

if __name__ == '__main__':
    vecs_pth = './data/preprocessed/imdb/title.basics.tsv.filtered.mt75.ts3'
    expert_popularity_label_pth = './data/preprocessed/imdb/popularity.imdb.mt75.csv'
    for proportion in [0.1, 0.3, 0.5, 0.7, 0.9]:
        print(f'proportion: {proportion}')
        label_generator(vecs_pth, expert_popularity_label_pth, proportion)

Here are the results for different proportions: proportion: 0.1 percentage of popular teams: 86.4 proportion: 0.3 percentage of popular teams: 82.7 proportion: 0.5 percentage of popular teams: 66.8 proportion: 0.7 percentage of popular teams: 52.7 proportion: 0.8 percentage of popular teams: 42.8 proportion: 0.9 percentage of popular teams: 40.7

hosseinfani commented 11 months ago

so go ahead with 0.7 but schedule the runs for all other proportions, also include 0.0 and 1.0 for testing purposes.

fani-lab / OpeNTF

Static Curriculum Learning Using Popularity Labels #228