Exploring various temporal approaches for neural team formation

VaghehDashti commented 2 years ago

Sequence-based vs temporal alignment-based: temporal alignment considers the difference in time but sequence-based just considers the order Graph-based vs non-graph-based TODO:

recurrent temporal nn
temporal GNN
evaluation strategy:
- split the whole data randomly
- split data until t-1
- windowed time

VaghehDashti commented 2 years ago

temporal toy dataset
naive baseline: 2.1. time as a feature input (one-hot encoded)(subset of skill and time) 2.2. temporal vector representation learning (transfer learning) 2.3. knn in vector space
end-to-end GNN
how far in history should be used to predict future

VaghehDashti commented 2 years ago

@hosseinfani Following Stanford's CS224 course on machine learning with graphs, I have studied until week 9 until now. I need to study until week 13 ( Community detection lecture ) which I will do by next week's meeting. At the same time, I'm learning how to code GNNs through exercises from the course.

The first paper I want to read is the following paper as I believe I can get ideas for our problem: 2021 - ACM - Learning Dynamic Embeddings for Temporal Knowledge Graphs

Finally, The main idea that I need to figure out is the best way to represent our data as a (temporal) graph or how to phrase our problem into a GNN architecture and how to train, validate, and test our model with graphs. It's more complicated than simple neural networks. Until this point of my learning, I feel like it cannot be phrased as an end-to-end GNN problem. I need to learn more :)

hosseinfani commented 2 years ago

@VaghehDashti please create a summary of the paper when you read it.

VaghehDashti commented 2 years ago

@hosseinfani I finished watching the course until week 13. Unfortunately, the community detection lecture formulates the problems an unsupervised ML problem (clustering) which makes it different from our problem definition. Moreover, I summarized the paper mentioned above and created an issue. Their proposed method is cumbersome, and I need some help. We can discuss this further in the weekly meeting.

hosseinfani commented 2 years ago

@VaghehDashti thanks for the update. I don't think unsupervised ml is different from our task. we'll discuss it today.

hosseinfani commented 2 years ago

Hi @VaghehDashti, Please see my commit on imdb branch 9d3b05c

Please double-check the logic and flow. We should continue the experiments on this flow.

Basically, I create a folder for each year, then call the base models to train on the samples of that year but initialized on the weights of last year and put the results in its folder. Please see:

https://github.com/fani-lab/OpeNTF/blob/imdb/src/mdl/tntf.py

The only bug right now is about the aggregate function for the final results.
Also, we only support 1 step ahead prediction (last year).
The year index is in the indexes.pkl file. It is not in the vecs anymore.
The train-test split on temporal was not valid. there is no need for it, actually. All previous years are training and the last year is test.
custum_dataset => cds, temporal_dataset => tcds

VaghehDashti commented 2 years ago

Non-temporal baselines:

Non streaming scenario but the last part of data as test set (FNN, and BNN) Temporal methods:
Streaming scenario:
- Without time as input (FNN, BNN, RNN, Transformer)
- with time as input (FNN with dt2v, BNN with dt2v, RNN, Transformer)

VaghehDashti commented 2 years ago

Results of experiments on DBLP show that the streaming scenario improves model performance for bnn. Adding time as input also improves model performance for bnn. lstm and transformer perform poorly with(out) streaming scenario.

Here is the AUCROC for the finished experiments: bnn_emb with unigram_b: 0.668093 (no streaming scenario) lstm: 0.5010 (no streaming scenario) transformer: 0.5010 (no streaming scenario)

tbnn_emb with unigram_b: 0.746918 (streaming scenario) bnn_dt2v_emb with unigram_b: 0.77006 (streaming scenario + time as input) lstm: 0.4999 (streaming scenario) transformer: 0.4999 (streaming scenario)

The following are the NDCG@10: bnn_emb with unigram_b: 0.2397 (no streaming scenario) lstm: 0.3087 (no streaming scenario) transformer: 0.2928 (no streaming scenario)

tbnn_emb with unigram_b: 0.4916 (streaming scenario) bnn_dt2v_emb with unigram_b: 0.7465 (streaming scenario + time as input) lstm: 0.1231 (streaming scenario) transformer: 0.124 (streaming scenario)

The code is being run for IMDB. I will update here afterwards.

hosseinfani commented 2 years ago

@VaghehDashti please put them in a chart so we can compare easily.

VaghehDashti commented 2 years ago

hosseinfani commented 2 years ago

@VaghehDashti lstm/transformer are based on the nmt, right?

Using streaming on the has no pos/neg effect? pls debug and make sure.

bnndt2v should be tbnndt2v, right?

When runnig for imdb, note that the best model for imdb was different from dblp.

How about temporal ir or recsys baselines?

VaghehDashti commented 2 years ago

@VaghehDashti lstm/transformer are based on the nmt, right?

Yes

Using streaming on the has no pos/neg effect? pls debug and make sure.

NDCG@10 decreases with the streaming scenario so there's no problem in the code.

bnndt2v should be tbnndt2v, right?

tbnn_dt2v_emb would add one 1 to the input (the time vector) so it should be bnn_dt2v_emb since there is already time as an aspect in there.

When runnig for imdb, note that the best model for imdb was different from dblp.

Yes

How about temporal ir or recsys baselines?

I will be working on it this week

hosseinfani commented 2 years ago

@VaghehDashti we don't have the result of tbnn_dt2v_emb yet?

VaghehDashti commented 2 years ago

@VaghehDashti we don't have the result of tbnn_dt2v_emb yet?

bnn_dt2v_emb already has time as an aspect embedded in it through doc2vec training by adding the time into the input.

hosseinfani commented 2 years ago

not sure I understood, can you please shortly (one line) explain each variation? thanks.

VaghehDashti commented 2 years ago

tbnn_emb learns the input embeddings with doc2vec for only skills and then goes through streaming scenario learning. it's the same as bnn_emb but with the streaming scenario learning. bnn_dt2v_emb learns the input embeddings with doc2vec for (skills + time) and then goes through streaming scenario learning.

hosseinfani commented 2 years ago

why don't you add letter "t" to the bnn_dt2v_emb then?

we have tbnn_emb and tbnn_dt2v_emb

VaghehDashti commented 2 years ago

I just double-checked the code to see why I didn't have 't' at the beginning. having 't' at the beginning would add one 1 (time vector) to the input vector, so my previous explanation of tbnn_emb was incorrect.

Here is the updated definition of our models (I will shortly update the code):

fnn, bnn, fnn_emb, bnn_emb, nmt: normal training without adding one 1 to the input
tfnn, tbnn, tfnn_emb, tbnn_emb, tnmt: streaming scenario without adding one 1 to the input (no time as aspect)
tfnn_a1, tbnn_a1, tfnn_emb_a1, tbnn_emb_a1: streaming scenario with adding one 1 to the input (time as aspect)
tfnn_dt2v_emb, tbnn_dt2v_emb: streaming scenario with adding the year to the doc2vec training (time as aspect)

In summary:

model names starting with 't' means that they will follow the streaming scenario
model names ending with _a1 means that they have one 1 added to their input for time as aspect learning
model names having _dt2v means that they learn the input embedding with doc2vec where input it is (skills + year)

With the new definitions here are the results for dblp: As you can see, I need to run tbnn_emb because the previous results were actually for tbnn_emb_a1.

VaghehDashti commented 2 years ago

Here are the final results on dblp including tbnn_emb with unigram_b: As can be seen, except for nmt-based models, the streaming scenario will improve model performance. Adding time as an aspect will also improve model performance only if done through doc2vec learning and even then the gain is less significant. The _a1 method which adds one 1 to every input vector decreases model performance slightly.

hosseinfani commented 2 years ago

@VaghehDashti Now, it's more clear. Thank you.

VaghehDashti commented 2 years ago

Here are the results on imdb: The results of tfnn and tfnn_a1 are strange. I will re-run the code but I don't think that will change the results unless I change the learning rate or some other hyperparameter. What do you think @hosseinfani?

Also, checking the results from our previous paper, fnn without negative sampling is doing better in IR metrics @2 and @5, but bnn_emb with unigram_b has the best performance on IR metrics @10 and also the AUCROC. And here we can see that tbnn_emb, tbnn_emb_a1, and tbnn_dt2v_emb (all with unigram_b negative sampling) outperform the non-temporal bnn_emb just like dblp which is great. I have started running the pipeline on uspt. Will update when they're ready.

hosseinfani commented 1 year ago

@VaghehDashti I was debugging the team2vec for dt2v and see this. Is that ok? I thought we concat the year to the skill to make it temporal.

hosseinfani commented 1 year ago

@VaghehDashti I pushed few lines of code to use i2y index only and drop the 'i2dt', 'dt2i', and 'i2tdt'. When you debugging about the above post, use i2y to generate the year stamp or year index. Let me know if we need to discuss this more.

https://github.com/fani-lab/OpeNTF/blob/b30aff972346b6e2fc1d4e3f56d2999fcd3070bd/src/mdl/team2vec.py#L43

VaghehDashti commented 1 year ago

hi @hosseinfani, Thank you for revealing the bug. I don't know why I had decided to use the index of datetime instead of the actual datetime! Also, I shouldn't have appended the datetime each time instead I should've just overwritten the previous datetime_doc. I fixed the bug and pushed the code. I have commented your new code because with the fixed code we don't need to use "if" in each iteration. I didn't remove your code so you can compare datetime_doc and year_doc that I created using your code and it's always the same without the "if" when you will review the code. I just used the i2tdt indices to create the temporal skills. now each instance's input will be like this: ['s1','s2','dt2000'] and the self.docs variable will not have extraneous datetimes.

I will re-run tbnn_dt2v_emb on all datasets with the new code.

hosseinfani commented 1 year ago

@VaghehDashti

I'm wondering we should do ['dt2000_s1', 'dt2000_s2'], not sure tho? Look at the base paper for diachronic word emb
I know we have an extra 'if' check, but please do that. I want to remove 'i2dt', 'dt2i', and 'i2tdt'. I think team2vec is the only place we use'm.

fani-lab / OpeNTF

Exploring various temporal approaches for neural team formation #66