fani-lab / OpeNTF

Neural machine learning methods for Team Formation problem.
Other
18 stars 12 forks source link

GNN for Graph based embedding generation #217

Open jamil2388 opened 8 months ago

jamil2388 commented 8 months ago

Starting this issue for tracking the learning progress of GNN

jamil2388 commented 8 months ago

I started a new project where I am trying to run a sample node-level classification task for a sample dataset 'Cora'. I created the model, and proceeded with the forward pass properly. I understand how GCNConv layers are basically performing message passing with one hop in each layer. But there is something wrong with the backward pass, it's not updating the weights and so, there is no learning happening. Trying to address this issue.

Also, meanwhile I tried to understand how pyg stores the graph structure for different datasets. I came to know that it has a unified structure for all types of graph which is of the type torch_geometric.data.Data. Also learned how the attributes x, y and edge_index are storing the graph information collectively.

jamil2388 commented 8 months ago

I can create a simple homogeneous graph data now. I will look to build a bigger data for a sample model. Then I will aim to learn heterogeneous graph data generation

jamil2388 commented 8 months ago

@hosseinfani

jamil2388 commented 8 months ago

Sample GCN ran with proper learning with expected accuracy. Changes performed :

Next time, need to implement something that does proper connection between purposeful functions

jamil2388 commented 8 months ago
hosseinfani commented 8 months ago

Hi @jamil2388 thanks for the update.

can you post the link to the exact line of code? I'm not sure I understood the logic of that line (we shouldn't have that line)

jamil2388 commented 8 months ago

Sorry @hosseinfani, I couldnt find a way to post the link to the exact line of code. But I will include the location of the line. It is in Line 84 of Publication class of the cikm22 branch

hosseinfani commented 8 months ago

@jamil2388

you mean this line? https://github.com/fani-lab/OpeNTF/blob/b9357eb8b89af43333ed15218fe20d3dfa77ba62/src/cmn/publication.py#L84

now I remember it :D I wanted to run the pipeline for the first nrow of the sparse matrix (kind of hidden feature for us, not general users :D)

btw, if you click on the side of codelines, you'll have options to create link for any codeline.

jamil2388 commented 8 months ago

Yes exactly, this line. I just needed to read the data without generating the sparse_matrix initially. So, should I comment this line out, or maybe just do my stuff with this modification for the time being? "if 'nrow' in settings['data']['domain']['dblp'].keys() and len(teams) > settings['data']['domain']['dblp']['nrow']: break"

jamil2388 commented 8 months ago

Mentioning @mahdis-saeedi in this thread to keep her posted with the updates in this GNN issue line.

hosseinfani commented 8 months ago

@jamil2388 no, the settings should have data > domain > {domain name} in param.py. but the existence of nrow is optional

jamil2388 commented 8 months ago

@hosseinfani please correct me if I am wrong and let me know what I am missing https://github.com/fani-lab/OpeNTF/blob/b9357eb8b89af43333ed15218fe20d3dfa77ba62/src/cmn/team.py#L94C9-L94C9

hosseinfani commented 7 months ago

I'll be in lab after 4pm. We can review it together. thanks

@jamil2388

jamil2388 commented 7 months ago
jamil2388 commented 7 months ago

Troubleshooting

jamil2388 commented 7 months ago
hosseinfani commented 7 months ago

@jamil2388 thanks for the update. Can you direct me to the code?

jamil2388 commented 7 months ago

@hosseinfani As I am working on my own fork, I am including the link to the main function of gnn_emb.py that is attempting to do the job : https://github.com/jamil2388/OpeNTF_Jamil/blob/7f81801760de926d1ade24b432c84d8483a5dd00/src/mdl/gnn_emb.py#L55

hosseinfani commented 7 months ago

@jamil2388 I'll be in lab 12-2pm. Let's do a quick code review. Thanks.

jamil2388 commented 7 months ago
jamil2388 commented 7 months ago
hosseinfani commented 7 months ago

@jamil2388 Thanks for the update. Please do the following:

Will talk to you on Wed for a quick code review.

jamil2388 commented 7 months ago

@hosseinfani

jamil2388 commented 7 months ago

@hosseinfani I have refactored the code. You can take a look at the following portions

hosseinfani commented 7 months ago

Hi @jamil2388 I made a huge change to your code. Still, not complete though. Probably, I finish it by Monday. Please have a look while I'm finishing the code refactor.

You need to code more efficiently as we are going to work with large-scale graphs.

I'll talk to you soon.

jamil2388 commented 7 months ago

@hosseinfani Thank you so much for the changes. I am looking into those meanwhile

jamil2388 commented 6 months ago

@hosseinfani Some updates regarding the changes that have been incorporated and also the ones ready to deploy from my local copy to this repo :

Here, I did not use any test phase. The losses are from the training phase. Also as I mentioned the issue of mini batching on multiple edge types, I had to use the entire training data in one go (unbatched). Only then the model ran successfully. I have the test loss calculated, but it definitely needs some correction.

Issues :

I will update some learnings of mine on a later post. Thanks!

hosseinfani commented 6 months ago

Hi @jamil2388 Thanks for the update. Please integrate them into our pipeline asap. So, for now, we can run the pipeline for the homogeneous graphs for available gnn methods.

hosseinfani commented 6 months ago

@jamil2388 I'm thinking that at gnn phase, we give the entire graph for training. so, no need for train/test splits.

However, later, when we great the graphs, we do the split at graph generation phase.

jamil2388 commented 6 months ago

@hosseinfani, I uploaded the gs_layer class. This class has the definitions for GraphSAGE implementations. In my local experiment, I kept separate structures for separate models like this GCN (init_model, train, learn) GCN_Layer (the layers for GCN model)

GS (init_model, train, learn) GS_Layer (the layers for GraphSAGE model)

But right now, the GS and GCN classes have exactly same implementation. I did not add GS class file because I the GCN class might be undergoing some refactors by you. I created separate layer class files because those classes needed to inherit torch.nn.module class.

One think that I feel that, the parts of the GCN and GS classes and maybe for other GNN models can be taken to the existing GNN class. (where it is right now designed to do all the common works) Then from GNN class, we create specific instances of the models (gs_layer or gcn_layer) based on the parameters.

hosseinfani commented 6 months ago

@jamil2388 Let's have a meeting and do some pair programming. I'll be available this week in lab.

jamil2388 commented 6 months ago

@hosseinfani that would be great.

jamil2388 commented 6 months ago

@hosseinfani , I am adding the following features of the current models we have ready for running

Model Homogeneous Heterogeneous Undirected Directed Duplicatededges
Node2vec Yes No Need to Confirm Need to Confirm Yes
Metapath2vec No Yes Need to Confirm Need to Confirm Yes
GCN Yes No Yes No Yes
GraphSAGE Yes Yes Yes No Yes
GAT Yes Yes Yes No Yes
GIN Yes Yes Yes No Yes
mahdis-saeedi commented 6 months ago

Here some properties of different GNN models are categorized and the models support heterogenous graphs are mentioned. https://pytorch-geometric.readthedocs.io/en/latest/cheatsheet/gnn_cheatsheet.html

jamil2388 commented 6 months ago

@hosseinfani I updated the repo with the GAT and GIN models (gat_layer and gin_layer) classes (works on both homo and hetero, updated in the previous table). I also tested out that with negative_sampling disabled, the models are producing very good test loss values compared to the previous test losses produced with negative_sampling enabled (very bad).

I needed some directions and help from you. As you mentioned about the node2vec and metapath2vec previously, you were editing some portions of node2vec on the main pipeline. I was looking to refactor my gnn classes according to the structure you give in the pipeline for node2vec or metapath2vec. If I had done the same code, it would be very erroneous. Right now, my gnn classes are waiting to be included in the pipeline (main.py) and also I was looking to modify the gnn.py taking the generalized portions from all the gnn models. A pair programming or a discussion session would be great for me, if possible from your side. Thanks!

hosseinfani commented 6 months ago

@jamil2388 Thank you for the update. I'm busy with finalizing my courses but we can meet early next week, Monday ....

jamil2388 commented 6 months ago

@hosseinfani, ⁠ the preprocessed folder now contains embeddings for dblp, imdb and uspt (GS, GCN, GAT, GIN). Except GAT, all of these models were run on CUDA. For the GAT part, it is causing cuda-out-of-memory error. I am still trying to figure that out (Probably, the computation cost is exceeding when we select the parameter "heads = 8" (standard value taken from the paper). Other than that, the timings and losses for the trainings are in this file in the "Emb" sheet :

https://docs.google.com/spreadsheets/d/1pz86JQ0a8XeX0AeXt07ayOVE7cat3Qw0FapuqIrzRR0/edit?usp=sharing

jamil2388 commented 5 months ago

Currently the results of test runs of OpeNTF with different sets of generated embeddings are logged in this google docs https://docs.google.com/spreadsheets/d/1AA5QCAVnKOjTAj2lNqHzO53mxQnM25zZlcBEprShOlM/edit?usp=sharing

jamil2388 commented 5 months ago
Train Test Split Prediction File Size Evaluation status
0.85 22 GB Never completes
0.95 7.6 GB Sometimes gets killed, sometimes holds on
0.99 1.5 GB Completes
jamil2388 commented 5 months ago

@hosseinfani, while trying to produce gnn->fnn results from imdb mt5.ts2 datasets, I failed a lot of times due to the loading times of prediction files in the eval phase (mentioned in the earlier comment). Now as I got some outcomes for the mt75.ts3 dataset (with split_ratio 0.85), (you can check some roc_auc_scores from the previously mentioned link https://docs.google.com/spreadsheets/d/1AA5QCAVnKOjTAj2lNqHzO53mxQnM25zZlcBEprShOlM/edit?usp=sharing), the pred file size has come down drastically to ~40 MB! But I am concerned about some stuffs happening :

Thanks!

jamil2388 commented 5 months ago

Notes about a finding in GNN batching issue :

Problem : While generating mini batches from a data with the LinkNeighborLoader (specific for link prediction purposes), for a single mbatch (mini batch), there are several components. For example : from the train_data split with graph type stm (Skill - Team, Member - Team), if we want to generate a mbatch, the mbatch will contain the following parts

 HeteroData(
  member={
    x=[3, 1],
    n_id=[3],
  },
  team={
    x=[9, 1],
    n_id=[9],
  },
  skill={
    x=[5, 1],
    n_id=[5],
  },
  (skill, to, team)={
    edge_index=[2, 3],
    edge_attr=[3],
    edge_label=[4],
    edge_label_index=[2, 4],
    e_id=[3],
    input_id=[2],
  },
  (member, to, team)={
    edge_index=[2, 3],
    edge_attr=[3],
    edge_label=[16],
    edge_label_index=[2, 16],
    e_id=[3],
  },
  ## reverse edge stuffs
  (team, rev_to, skill)={
    edge_index=[2, 6],
    edge_attr=[6],
    e_id=[6],
  },
  (team, rev_to, member)={
    edge_index=[2, 0],
    edge_attr=[0],
    e_id=[0],
  }
)

We can see that n_id and edge_label_index gets generated and if negative sampling is enabled, there will be negative edges in this edge_label_index with the corresponding edge_labels 0. The problem is, it should be obvious that the node ids mentioned in edge_label_indices should be present in the list of n_id of the individual node types. But unfortunately, if we consider to map the n_id from edge_label_indices to there respective node_type n_ids, they do not match.

Solution : As discussed in this section of slack by the pyg team https://torchgeometricco.slack.com/archives/C01DN0B3B1N/p1701860291891099?thread_ts=1701354805.778269&cid=C01DN0B3B1N Also another small reference of "mapping n_id back'' here https://github.com/pyg-team/pytorch_geometric/discussions/7797#discussioncomment-6549639

The nodes mentioned in the edge_label_indices of the mbatches are locally generated n_ids pointing to the position / index of the global n_ids, unlike the global n_ids in the individual node_types.

Example : In the mentioned mbatch we have,

---- n_id of the nodes edge_label_index with local indices edge_label_index with n_ids (after mapping)
skill nodes 3, 5, 7, 9, 1 1, 0, 3, 2 5, 3, 9, 7
team nodes 3, 7, 25, 2, 16, 17, 13, 10, 29 2, 0, 1, 0 25, 3, 7, 3

In order to work with the edge_label_indices like in this example, I had to incorporate a mapping like below :

mbatch['skill'].n_id[mbatch['skill','to','team'].edge_label_index[1]]
mbatch['team'].n_id[mbatch['skill','to','team'].edge_label_index[1]]

which produced the edge_label_indices of edge_type skill_to_team with actual n_ids of the skill and team nodes as shown in the last column

jamil2388 commented 3 months ago

@hosseinfani I am trying to ensure whether my approach is correct for testing with random data.

https://github.com/fani-lab/OpeNTF/blob/9f00fd021a9da12cce7ccd2f59df940b45361161/src/main.py#L183

Here, these conditions only satisfy when I set emb_model (to any gnn model) and emb_random = 0 to 3 Here, consider emb as gnn embedding and emb_skill as embedding of only skills.

emb_random = 0 -> will dot product vecs['skill'] with the emb_skill as it is (no randomness) (output shape will be n_teams n_dimensions) emb_random = 1 -> will dot product vecs['skill'] with the emb_skill with emb_skill having random embedding data (output shape will be n_teams n_dimensions) emb_random = 2 -> random sparse matrix vecs['skill'] with values (0,1) of shape (n_teams n_dimensions) emb_random = 3 -> random sparse matrix vecs['skill'] with values (0,1) of shape (n_teams n_skills)

I am focused mostly on embrandom = 1 and 3 because they portray randomness for gnn skill embedding and skill matrix respectively. In short, I am only replacing the skill part of the teamsvecs sparse matrix with randomness_ and then feed into FNN or BNN. The confusion that I am facing is, while training, FNN or BNN will learn to map these random skills with the actual experts in their own way to finally adjust to produce the correct experts. Which means, whatever we feed as skills, they will compare the predicted experts from vecs['member'] and adjust the prediction with gradual learning to eventually predict the correct experts (with the wrong sets of skills). Is my approach correct? Or I am looking at it the wrong way?

Thanks!

hosseinfani commented 3 months ago

Hi @jamil2388 Thanks for the update. You're right, option 1 and 3 make sense.

Regarding your question, if an expert is working on a skill for many teams, like Jamil on GNN, then when we shuffle the skills for the teams of Jamil, the model would learn other skills for Jamil, like ML. Then during the test, a test team for ML would have Jamil, not GNN, so the test results should drop. In another word, when we randomize the skills for experts, the specialty of experts in few skills will be ignored by the model.

I'm in the lab today. we can talk more on this.

jamil2388 commented 3 months ago

@hosseinfani , I will demonstrate some points about hyperparameter tuning observations on FNN in several comments here.

Effect of Learning Rate (LR)

FNN was run on LR 0.1, 0.01 and 0.001 and on different types of losses and negative sampling (none, uniform, unigram, unigram_b, weighted, positive cross entropy)

My theoretical perception is that, if I have a decent enough adjustment of the hyperparameters, I would see a better train vs valid loss curve which would imply that the model is actually learning. So I observed the patterns of the loss curves for these different sets of LRs. 0.001 clearly shows that the model learning at least something from the data, unlike the erratic behavior in case of LR 0.1 and 0.01. The following figure is for the data dblp mt100.ts5 and a distribution t2151.s4289.m3373 :

s1

You can see in the figure that that Row 3 (LR = 0.001) clearly has an understandable learning curve than the other two setups in Rows 1 and 2. My understanding of the Ideal learning curve is aided by this article https://rstudio-conf-2020.github.io/dl-keras-tf/notebooks/learning-curve-diagnostics.nb.html#:~:text=An%20optimal%20fit%20is%20one,zero%20in%20an%20ideal%20situation).

Another part is that each training run is aided by earlystopping. This earlystopping stops the training, whenever the validation loss is no longer improving for patience = 5 number of epochs. As the trainings were run for 50 epochs, earlystopping around halfway (25 epochs) or maybe above would imply that the validation loss still improved. Coincidently, the average earlystoppings in these setups also showed a logical pattern. As a whole, I could infer that, LR 0.001 allowed the model to train more and better based on these observations and also the AUCROC score listed in the document https://docs.google.com/spreadsheets/d/1AA5QCAVnKOjTAj2lNqHzO53mxQnM25zZlcBEprShOlM/edit?usp=sharing Below is the overall inference :

Row Learning Rate (LR) EarlyStopping (Mostly around) Inference
1 0.1 5-8 Epochs No trace of learning
2 0.01 5-12 Epochs No trace of learning
3 0.001 Above 10 - 50 (No stopping) Quite some trace of learning

I will update the other findings successively.

jamil2388 commented 3 months ago

@hosseinfani

Comparison against Random and Metapath2Vec baselines :

This time, I initiated FNN on dblp mt120.ts3 (distribution t34285.s18163.m5381) data for 25 epochs (to quickly gather insights). FNN was run on

  1. Sparse Matrices, [non-emb]
  2. Random Sparse Matrices (the mistaken one), [non-emb random]
  3. Random Sparse Matrices (the corrected one), [non-emb random updated]
  4. Metapath2Vec Dense Matrices (size - 32, 64, 128, epoch - 100, negative sampling - 5 etc.) [m2v.e100.dX]

Mistaken Random matrices had the entire vecs['skill'] replaced by a random skill matrix. I had to correct it by replacing only the train and valid split of the skill matrix with the random one with appropriate size. The comparison of the results came out interesting. Again, we see a meaningful result only when the LR is 0.001.

For Actual data (non-emb) vs Random data (non-emb random updated), We got 100% expected result when the LR is 0.001. For other LRs, the random AUCROC sometimes outperforms the original one in case of uniform and positive cross entropy setups! And also, the random test is validated with all bad results (red values in column non-emb random updated) compared to the random results of non-emb random column and also other results.

The results are below : s2

But one mysterious point is, If the skill matrix is replaced by the random matrix (entirely), why is it producing the best scores? (Considering the random nature of the matrix, it should not contribute to a good score overall) You can see that the non-emb random always gets a slightly better score than its counterpart in the non-emb column in the LR 0.001 section.

There are more inferences to be made out of these scores. We can clearly see the concentration of good results (including the best) in the LR 0.001 section and mostly in the weighted section. I am currently ignoring the negative sampling _unigram_b_ because it's taking too much time per instance of training and also producing bad results consistently throughout the experiments. I am trying to optimize the metapath2vec model to check if it crosses the baseline FNN scores (Theoretically, which should be true). I will also add other GNN methods accordingly to get more trends and scores.

hosseinfani commented 2 months ago

Hi @jamil2388 Thanks for the update. Nice that we found the issue finally :) Just a quick reminder about our research questions, which are directing our research: RQ1. Which gnn is the best for our task using transfer learning? RQ2. Which dimension is the best? RQ3. Which classifier is the best (fnn vs. bnn vs. negative samplings)?

jamil2388 commented 2 months ago

@hosseinfani Thanks for responding. While doing experiments, I honestly lost track of the research questions. This will keep me on track, Thanks to your heads up ^_^ I collected some results which might give us a trend at the very least. I will organize and post them soon.

jamil2388 commented 2 months ago

@hosseinfani, I was looking at filtered data generation process (For mt120.ts3), then I found something unusual. The following error occurs :

team.py:167: RuntimeWarning: invalid value encountered in cast
  data_[j] = team.get_one_hot(s2i, c2i, l2i, location_type)

Due to this error, the filtered data that generates, somehow does not show any error elsewhere, and the sparse matrix containing team ids (vecs['id']) has a lot of zeros in them (meaning same id in multiple teams!) I presume it occurs because of using location and location based indices. Now is there any workaround for this method? The Team.Bucketing method takes such location based arguments.

https://github.com/fani-lab/OpeNTF/blob/03dbaa242559f91687b2282ea58f17c277082fb1/src/cmn/team.py#L150

Both of these lines get error : https://github.com/fani-lab/OpeNTF/blob/03dbaa242559f91687b2282ea58f17c277082fb1/src/cmn/team.py#L161 https://github.com/fani-lab/OpeNTF/blob/03dbaa242559f91687b2282ea58f17c277082fb1/src/cmn/team.py#L167

Is there any way for me to ignore location based arguments, any modifications that you might suggest? I just need to generate some filtered data. Thanks!

jamil2388 commented 1 month ago

@hosseinfani

I added the file fbnn.py for my testing to check the new bnn. As we discussed, I took this library from https://github.com/IntelLabs/bayesian-torch/tree/main

The code is running on full datasets. But some results that I got so far are lower than the previous version of bnn. So it needs some basic tuning and then some specific tuning. But I am totally unsure about the insides of this model. Could you please take a look at the model? I needed to assure that the model is logically correct and get some instruction on how to tune.

It is under the opentf pipeline, I replaced the train, valid and test calculation portion only with the one instructed by the library. Otherwise all the data handling parts are very similar with the bnn.

The parameter settings at init : https://github.com/fani-lab/OpeNTF/blob/14ada7794b24229b5431008e5463bd204250df20/src/mdl/fbnn.py#L44 train calculation : https://github.com/fani-lab/OpeNTF/blob/14ada7794b24229b5431008e5463bd204250df20/src/mdl/fbnn.py#L153 valid calculation : https://github.com/fani-lab/OpeNTF/blob/14ada7794b24229b5431008e5463bd204250df20/src/mdl/fbnn.py#L165 test calculation : https://github.com/fani-lab/OpeNTF/blob/14ada7794b24229b5431008e5463bd204250df20/src/mdl/fbnn.py#L240

Thanks a lot!

jamil2388 commented 1 month ago

@hosseinfani Also, I am using the basic crossentropyloss function for the loss part in train and valid With the existing workflow, can I just replace the part here with the uniform sampling part? (As I am currently only trying to focus on uniform sampling technique)

crossentropy with 'none' negative sampling: https://github.com/fani-lab/OpeNTF/blob/14ada7794b24229b5431008e5463bd204250df20/src/mdl/fbnn.py#L159 the negative sampling counterpart of bnn : https://github.com/fani-lab/OpeNTF/blob/14ada7794b24229b5431008e5463bd204250df20/src/mdl/bnn.py#L168

Please note that, I dont have the sample_elbo calculation which gives us the layer_loss in the bnn counterpart. Then the layer_loss is aggregated along with the loss calculation. This part has been skipped in my fbnn for the time being