How to ensure to ppl in test set have been seen in the train set?

littlebeanbean7 commented 2 weeks ago

Hello Fani Lab team,

I hope you are well! I wanted to use OpeNTF to run baseline model, ideally if your function could take my train/test set's id (eg paper id in dblp data) as an input parameter, that would be very handy to use. But I don't find such an option.

I found in main.py that if I don't do time split, the train/test split is calling sklearn's train_test_split().

Then, my question is: would you ensure people (eg authors in dblp data) in the Test set have appeared in Train set? If yes, could you please point me to the code where do you do this? If not, could you please explain why we don't need to do that?

Thank you very much, Lingling

hosseinfani commented 2 weeks ago

Hi @littlebeanbean7 there is no garantee. the split is based on team instances. So, there is a chance that an expert, or all experts of a team, have not been seen during the training.

However, there is a filtering step in our pipeline that filters out the sparse experts, that is, to remove the experts who have less than a number of teams. This way, you can make sure that for each expert, there are at least some number of teams in the dataset. Hence, when you split, there is a chance that the expert happens to be in the train and test in some teams.

Also, since the evaluation is n-fold, and the final result is on the average of n models, each trained on each fold, there is an even lower chance of zero-shot for an expert.

littlebeanbean7 commented 2 weeks ago

Thank you for your kind reply @hosseinfani ! I will add a chunk of code to load in Train and Test sets in main.py to ensure fair comparison with my experiment.

fani-lab / OpeNTF

How to ensure to ppl in test set have been seen in the train set? #265