fani-lab / OpeNTF

Neural machine learning methods for Team Formation problem.
Other
18 stars 13 forks source link

Exploring various temporal approaches for neural team formation #66

Open VaghehDashti opened 2 years ago

VaghehDashti commented 2 years ago

Sequence-based vs temporal alignment-based: temporal alignment considers the difference in time but sequence-based just considers the order Graph-based vs non-graph-based TODO:

VaghehDashti commented 2 years ago
  1. temporal toy dataset
  2. naive baseline: 2.1. time as a feature input (one-hot encoded)(subset of skill and time) 2.2. temporal vector representation learning (transfer learning) 2.3. knn in vector space
  3. end-to-end GNN
  4. how far in history should be used to predict future
VaghehDashti commented 2 years ago

@hosseinfani Following Stanford's CS224 course on machine learning with graphs, I have studied until week 9 until now. I need to study until week 13 ( Community detection lecture ) which I will do by next week's meeting. At the same time, I'm learning how to code GNNs through exercises from the course.

The first paper I want to read is the following paper as I believe I can get ideas for our problem: 2021 - ACM - Learning Dynamic Embeddings for Temporal Knowledge Graphs

Finally, The main idea that I need to figure out is the best way to represent our data as a (temporal) graph or how to phrase our problem into a GNN architecture and how to train, validate, and test our model with graphs. It's more complicated than simple neural networks. Until this point of my learning, I feel like it cannot be phrased as an end-to-end GNN problem. I need to learn more :)

hosseinfani commented 2 years ago

@VaghehDashti please create a summary of the paper when you read it.

VaghehDashti commented 2 years ago

@hosseinfani I finished watching the course until week 13. Unfortunately, the community detection lecture formulates the problems an unsupervised ML problem (clustering) which makes it different from our problem definition. Moreover, I summarized the paper mentioned above and created an issue. Their proposed method is cumbersome, and I need some help. We can discuss this further in the weekly meeting.

hosseinfani commented 2 years ago

@VaghehDashti thanks for the update. I don't think unsupervised ml is different from our task. we'll discuss it today.

hosseinfani commented 2 years ago

Hi @VaghehDashti, Please see my commit on imdb branch 9d3b05c

Please double-check the logic and flow. We should continue the experiments on this flow.

Basically, I create a folder for each year, then call the base models to train on the samples of that year but initialized on the weights of last year and put the results in its folder. Please see:

https://github.com/fani-lab/OpeNTF/blob/imdb/src/mdl/tntf.py

VaghehDashti commented 2 years ago

Non-temporal baselines:

VaghehDashti commented 2 years ago

Results of experiments on DBLP show that the streaming scenario improves model performance for bnn. Adding time as input also improves model performance for bnn. lstm and transformer perform poorly with(out) streaming scenario.

Here is the AUCROC for the finished experiments: bnn_emb with unigram_b: 0.668093 (no streaming scenario) lstm: 0.5010 (no streaming scenario) transformer: 0.5010 (no streaming scenario)

tbnn_emb with unigram_b: 0.746918 (streaming scenario) bnn_dt2v_emb with unigram_b: 0.77006 (streaming scenario + time as input) lstm: 0.4999 (streaming scenario) transformer: 0.4999 (streaming scenario)

The following are the NDCG@10: bnn_emb with unigram_b: 0.2397 (no streaming scenario) lstm: 0.3087 (no streaming scenario) transformer: 0.2928 (no streaming scenario)

tbnn_emb with unigram_b: 0.4916 (streaming scenario) bnn_dt2v_emb with unigram_b: 0.7465 (streaming scenario + time as input) lstm: 0.1231 (streaming scenario) transformer: 0.124 (streaming scenario)

The code is being run for IMDB. I will update here afterwards.

hosseinfani commented 2 years ago

@VaghehDashti please put them in a chart so we can compare easily.

VaghehDashti commented 2 years ago

image

hosseinfani commented 2 years ago

@VaghehDashti lstm/transformer are based on the nmt, right?

Using streaming on the has no pos/neg effect? pls debug and make sure.

bnndt2v should be tbnndt2v, right?

When runnig for imdb, note that the best model for imdb was different from dblp.

How about temporal ir or recsys baselines?

VaghehDashti commented 2 years ago

@VaghehDashti lstm/transformer are based on the nmt, right?

Yes

Using streaming on the has no pos/neg effect? pls debug and make sure.

NDCG@10 decreases with the streaming scenario so there's no problem in the code.

bnndt2v should be tbnndt2v, right?

tbnn_dt2v_emb would add one 1 to the input (the time vector) so it should be bnn_dt2v_emb since there is already time as an aspect in there.

When runnig for imdb, note that the best model for imdb was different from dblp.

Yes

How about temporal ir or recsys baselines?

I will be working on it this week

hosseinfani commented 2 years ago

@VaghehDashti we don't have the result of tbnn_dt2v_emb yet?

VaghehDashti commented 2 years ago

@VaghehDashti we don't have the result of tbnn_dt2v_emb yet?

bnn_dt2v_emb already has time as an aspect embedded in it through doc2vec training by adding the time into the input.

hosseinfani commented 2 years ago

not sure I understood, can you please shortly (one line) explain each variation? thanks.

VaghehDashti commented 2 years ago

tbnn_emb learns the input embeddings with doc2vec for only skills and then goes through streaming scenario learning. it's the same as bnn_emb but with the streaming scenario learning. bnn_dt2v_emb learns the input embeddings with doc2vec for (skills + time) and then goes through streaming scenario learning.

hosseinfani commented 2 years ago

why don't you add letter "t" to the bnn_dt2v_emb then?

we have tbnn_emb and tbnn_dt2v_emb

VaghehDashti commented 2 years ago

I just double-checked the code to see why I didn't have 't' at the beginning. having 't' at the beginning would add one 1 (time vector) to the input vector, so my previous explanation of tbnn_emb was incorrect.

Here is the updated definition of our models (I will shortly update the code):

In summary:

With the new definitions here are the results for dblp: image As you can see, I need to run tbnn_emb because the previous results were actually for tbnn_emb_a1.

VaghehDashti commented 2 years ago

Here are the final results on dblp including tbnn_emb with unigram_b: image As can be seen, except for nmt-based models, the streaming scenario will improve model performance. Adding time as an aspect will also improve model performance only if done through doc2vec learning and even then the gain is less significant. The _a1 method which adds one 1 to every input vector decreases model performance slightly.

hosseinfani commented 2 years ago

@VaghehDashti Now, it's more clear. Thank you.

VaghehDashti commented 2 years ago

Here are the results on imdb: image The results of tfnn and tfnn_a1 are strange. I will re-run the code but I don't think that will change the results unless I change the learning rate or some other hyperparameter. What do you think @hosseinfani?

Also, checking the results from our previous paper, fnn without negative sampling is doing better in IR metrics @2 and @5, but bnn_emb with unigram_b has the best performance on IR metrics @10 and also the AUCROC. And here we can see that tbnn_emb, tbnn_emb_a1, and tbnn_dt2v_emb (all with unigram_b negative sampling) outperform the non-temporal bnn_emb just like dblp which is great. I have started running the pipeline on uspt. Will update when they're ready.

hosseinfani commented 1 year ago

@VaghehDashti I was debugging the team2vec for dt2v and see this. Is that ok? I thought we concat the year to the skill to make it temporal.

image

hosseinfani commented 1 year ago

@VaghehDashti I pushed few lines of code to use i2y index only and drop the 'i2dt', 'dt2i', and 'i2tdt'. When you debugging about the above post, use i2y to generate the year stamp or year index. Let me know if we need to discuss this more.

https://github.com/fani-lab/OpeNTF/blob/b30aff972346b6e2fc1d4e3f56d2999fcd3070bd/src/mdl/team2vec.py#L43

VaghehDashti commented 1 year ago

hi @hosseinfani, Thank you for revealing the bug. I don't know why I had decided to use the index of datetime instead of the actual datetime! Also, I shouldn't have appended the datetime each time instead I should've just overwritten the previous datetime_doc. I fixed the bug and pushed the code. I have commented your new code because with the fixed code we don't need to use "if" in each iteration. I didn't remove your code so you can compare datetime_doc and year_doc that I created using your code and it's always the same without the "if" when you will review the code. I just used the i2tdt indices to create the temporal skills. now each instance's input will be like this: ['s1','s2','dt2000'] and the self.docs variable will not have extraneous datetimes.

I will re-run tbnn_dt2v_emb on all datasets with the new code.

hosseinfani commented 1 year ago

@VaghehDashti