Open jamil2388 opened 1 year ago
I started a new project where I am trying to run a sample node-level classification task for a sample dataset 'Cora'. I created the model, and proceeded with the forward pass properly. I understand how GCNConv layers are basically performing message passing with one hop in each layer. But there is something wrong with the backward pass, it's not updating the weights and so, there is no learning happening. Trying to address this issue.
Also, meanwhile I tried to understand how pyg stores the graph structure for different datasets. I came to know that it has a unified structure for all types of graph which is of the type torch_geometric.data.Data. Also learned how the attributes x, y and edge_index are storing the graph information collectively.
I can create a simple homogeneous graph data now. I will look to build a bigger data for a sample model. Then I will aim to learn heterogeneous graph data generation
@hosseinfani
Sample GCN ran with proper learning with expected accuracy. Changes performed :
Next time, need to implement something that does proper connection between purposeful functions
Hi @jamil2388 thanks for the update.
can you post the link to the exact line of code? I'm not sure I understood the logic of that line (we shouldn't have that line)
Sorry @hosseinfani, I couldnt find a way to post the link to the exact line of code. But I will include the location of the line. It is in Line 84 of Publication class of the cikm22 branch
@jamil2388
you mean this line? https://github.com/fani-lab/OpeNTF/blob/b9357eb8b89af43333ed15218fe20d3dfa77ba62/src/cmn/publication.py#L84
now I remember it :D I wanted to run the pipeline for the first nrow of the sparse matrix (kind of hidden feature for us, not general users :D)
btw, if you click on the side of codelines, you'll have options to create link for any codeline.
Yes exactly, this line. I just needed to read the data without generating the sparse_matrix initially. So, should I comment this line out, or maybe just do my stuff with this modification for the time being? "if 'nrow' in settings['data']['domain']['dblp'].keys() and len(teams) > settings['data']['domain']['dblp']['nrow']: break"
Mentioning @mahdis-saeedi in this thread to keep her posted with the updates in this GNN issue line.
@jamil2388 no, the settings should have data > domain > {domain name} in param.py. but the existence of nrow is optional
@hosseinfani please correct me if I am wrong and let me know what I am missing https://github.com/fani-lab/OpeNTF/blob/b9357eb8b89af43333ed15218fe20d3dfa77ba62/src/cmn/team.py#L94C9-L94C9
I'll be in lab after 4pm. We can review it together. thanks
@jamil2388
Created a model to generate Transductive (Shallow) embeddings using torch_geometric.nn.Node2Vec, which only works for homogeneous graphs.
Tested on Cora dataset and a simple 3-noded graph to generate the Node embeddings of preferred size
The model reference has been taken from https://pytorch-geometric.readthedocs.io/en/latest/tutorial/shallow_node_embeddings.html
Unfortunately, the generated embeddings are not yet compared to any reference values or any ground truth (which can determine the validity of the embeddings)
But according to the referenced documentation, this Node2Vec can generate the node embeddings based on the positive and negative random walks from the nodes. The loss is calculated from these positive and negative random walk results in each iteration
Need to discover what is happening to generate these embeddings
How good these embeddings are in terms of a reference value
Need to use Metapath2Vec to fit the dblp toy data (HeteroData) to generate the embeddings
@jamil2388 thanks for the update. Can you direct me to the code?
@hosseinfani As I am working on my own fork, I am including the link to the main function of gnn_emb.py that is attempting to do the job : https://github.com/jamil2388/OpeNTF_Jamil/blob/7f81801760de926d1ade24b432c84d8483a5dd00/src/mdl/gnn_emb.py#L55
@jamil2388 I'll be in lab 12-2pm. Let's do a quick code review. Thanks.
@jamil2388 Thanks for the update. Please do the following:
Will talk to you on Wed for a quick code review.
@hosseinfani
@hosseinfani I have refactored the code. You can take a look at the following portions
Hi @jamil2388 I made a huge change to your code. Still, not complete though. Probably, I finish it by Monday. Please have a look while I'm finishing the code refactor.
You need to code more efficiently as we are going to work with large-scale graphs.
I'll talk to you soon.
@hosseinfani Thank you so much for the changes. I am looking into those meanwhile
@hosseinfani Some updates regarding the changes that have been incorporated and also the ones ready to deploy from my local copy to this repo :
Incorporated GCN for homogeneous graphs only. The code is adaptable for all types of graphs. But unfortunately, GCNConv layers need a special type of message passing which is only doable in our homogeneous graphs right now. Still looking for a workaround to adapt the message passing which requires self loops.
Based on the outcome of GCN, I can deploy GraphSAGE in this repo. The code is in the local copy and is very similar to the structure of GCN. If GCN adapts with the pipeline of gnn, I can apply GraphSAGE to the remote repo. GraphSAGE can run on any types of graphs (Except Directed graphs only).
I have run GCN and GraphSAGE for 4 main pickle files, and the outcomes are as below :
Data --------------------------------------------------------Loss (Epoch) ------------------------------- Time taken for total epochs
m.undir.none.data.pkl (GCN) ----------------------- .0292 (100) || .0014 (1000) ---------------------- 45.8 minutes (1000)
m.undir.mean.data.pkl (GCN) ----------------------- .0300 (100) || .016 (1000) ------------------------ 46.11 minutes (1000)
stm.undir.none.data.pkl (GraphSAGE) -------------- .0004 (100) --------------------------------------- 2.2 hours (100)
stm.undir.mean.data.pkl (GraphSAGE) ------------- .0008 (100) --------------------------------------- 44.75 minutes (100)
Here, I did not use any test phase. The losses are from the training phase. Also as I mentioned the issue of mini batching on multiple edge types, I had to use the entire training data in one go (unbatched). Only then the model ran successfully. I have the test loss calculated, but it definitely needs some correction.
Issues :
I will update some learnings of mine on a later post. Thanks!
Hi @jamil2388 Thanks for the update. Please integrate them into our pipeline asap. So, for now, we can run the pipeline for the homogeneous graphs for available gnn methods.
@jamil2388 I'm thinking that at gnn phase, we give the entire graph for training. so, no need for train/test splits.
However, later, when we great the graphs, we do the split at graph generation phase.
@hosseinfani, I uploaded the gs_layer class. This class has the definitions for GraphSAGE implementations. In my local experiment, I kept separate structures for separate models like this GCN (init_model, train, learn) GCN_Layer (the layers for GCN model)
GS (init_model, train, learn) GS_Layer (the layers for GraphSAGE model)
But right now, the GS and GCN classes have exactly same implementation. I did not add GS class file because I the GCN class might be undergoing some refactors by you. I created separate layer class files because those classes needed to inherit torch.nn.module class.
One think that I feel that, the parts of the GCN and GS classes and maybe for other GNN models can be taken to the existing GNN class. (where it is right now designed to do all the common works) Then from GNN class, we create specific instances of the models (gs_layer or gcn_layer) based on the parameters.
@jamil2388 Let's have a meeting and do some pair programming. I'll be available this week in lab.
@hosseinfani that would be great.
@hosseinfani , I am adding the following features of the current models we have ready for running
Model | Homogeneous | Heterogeneous | Undirected | Directed | Duplicatededges |
---|---|---|---|---|---|
Node2vec | Yes | No | Need to Confirm | Need to Confirm | Yes |
Metapath2vec | No | Yes | Need to Confirm | Need to Confirm | Yes |
GCN | Yes | No | Yes | No | Yes |
GraphSAGE | Yes | Yes | Yes | No | Yes |
GAT | Yes | Yes | Yes | No | Yes |
GIN | Yes | Yes | Yes | No | Yes |
Here some properties of different GNN models are categorized and the models support heterogenous graphs are mentioned. https://pytorch-geometric.readthedocs.io/en/latest/cheatsheet/gnn_cheatsheet.html
@hosseinfani I updated the repo with the GAT and GIN models (gat_layer and gin_layer) classes (works on both homo and hetero, updated in the previous table). I also tested out that with negative_sampling disabled, the models are producing very good test loss values compared to the previous test losses produced with negative_sampling enabled (very bad).
I needed some directions and help from you. As you mentioned about the node2vec and metapath2vec previously, you were editing some portions of node2vec on the main pipeline. I was looking to refactor my gnn classes according to the structure you give in the pipeline for node2vec or metapath2vec. If I had done the same code, it would be very erroneous. Right now, my gnn classes are waiting to be included in the pipeline (main.py) and also I was looking to modify the gnn.py taking the generalized portions from all the gnn models. A pair programming or a discussion session would be great for me, if possible from your side. Thanks!
@jamil2388 Thank you for the update. I'm busy with finalizing my courses but we can meet early next week, Monday ....
@hosseinfani, the preprocessed folder now contains embeddings for dblp, imdb and uspt (GS, GCN, GAT, GIN). Except GAT, all of these models were run on CUDA. For the GAT part, it is causing cuda-out-of-memory error. I am still trying to figure that out (Probably, the computation cost is exceeding when we select the parameter "heads = 8" (standard value taken from the paper). Other than that, the timings and losses for the trainings are in this file in the "Emb" sheet :
https://docs.google.com/spreadsheets/d/1pz86JQ0a8XeX0AeXt07ayOVE7cat3Qw0FapuqIrzRR0/edit?usp=sharing
Currently the results of test runs of OpeNTF with different sets of generated embeddings are logged in this google docs https://docs.google.com/spreadsheets/d/1AA5QCAVnKOjTAj2lNqHzO53mxQnM25zZlcBEprShOlM/edit?usp=sharing
Train Test Split | Prediction File Size | Evaluation status |
---|---|---|
0.85 | 22 GB | Never completes |
0.95 | 7.6 GB | Sometimes gets killed, sometimes holds on |
0.99 | 1.5 GB | Completes |
@hosseinfani, while trying to produce gnn->fnn results from imdb mt5.ts2 datasets, I failed a lot of times due to the loading times of prediction files in the eval phase (mentioned in the earlier comment). Now as I got some outcomes for the mt75.ts3 dataset (with split_ratio 0.85), (you can check some roc_auc_scores from the previously mentioned link https://docs.google.com/spreadsheets/d/1AA5QCAVnKOjTAj2lNqHzO53mxQnM25zZlcBEprShOlM/edit?usp=sharing), the pred file size has come down drastically to ~40 MB! But I am concerned about some stuffs happening :
The early stopping is not occurring when I switched from none to 'unigram_b' negative sampling in fnn. Which means, the training is running for entire 10 epochs and 3 folds. It is to note that I only changed the negative sampling setting in the param file. For your convenience, here is the link for the early stopping in fnn https://github.com/fani-lab/OpeNTF/blob/579e089fd44f9f590673dbb07ff87c272fd9d087/src/mdl/fnn.py#L262
The image shows the last portion of the training on GIN 500 epochs dim 64> FNN_unigram_b 10 epochs. If I am not mistaken, the early stopping is not triggering upto counter 5 unlike the previous FNN none setting.
You can see the trend here when the pipeline is running for 10 epochs and 3 folds for two GNN sm mean 64
And for GAT stm mean 32
The training time is taking almost 2 hours sometimes. Should I change any setting here? Like the learning rate (currently 0.001), epochs, early stopping, folds or anything else?
Thanks!
Notes about a finding in GNN batching issue :
Problem : While generating mini batches from a data with the LinkNeighborLoader (specific for link prediction purposes), for a single mbatch (mini batch), there are several components. For example : from the train_data split with graph type stm (Skill - Team, Member - Team), if we want to generate a mbatch, the mbatch will contain the following parts
HeteroData(
member={
x=[3, 1],
n_id=[3],
},
team={
x=[9, 1],
n_id=[9],
},
skill={
x=[5, 1],
n_id=[5],
},
(skill, to, team)={
edge_index=[2, 3],
edge_attr=[3],
edge_label=[4],
edge_label_index=[2, 4],
e_id=[3],
input_id=[2],
},
(member, to, team)={
edge_index=[2, 3],
edge_attr=[3],
edge_label=[16],
edge_label_index=[2, 16],
e_id=[3],
},
## reverse edge stuffs
(team, rev_to, skill)={
edge_index=[2, 6],
edge_attr=[6],
e_id=[6],
},
(team, rev_to, member)={
edge_index=[2, 0],
edge_attr=[0],
e_id=[0],
}
)
We can see that n_id and edge_label_index gets generated and if negative sampling is enabled, there will be negative edges in this edge_label_index with the corresponding edge_labels 0. The problem is, it should be obvious that the node ids mentioned in edge_label_indices should be present in the list of n_id of the individual node types. But unfortunately, if we consider to map the n_id from edge_label_indices to there respective node_type n_ids, they do not match.
Solution : As discussed in this section of slack by the pyg team https://torchgeometricco.slack.com/archives/C01DN0B3B1N/p1701860291891099?thread_ts=1701354805.778269&cid=C01DN0B3B1N Also another small reference of "mapping n_id back'' here https://github.com/pyg-team/pytorch_geometric/discussions/7797#discussioncomment-6549639
The nodes mentioned in the edge_label_indices of the mbatches are locally generated n_ids pointing to the position / index of the global n_ids, unlike the global n_ids in the individual node_types.
Example : In the mentioned mbatch we have,
---- | n_id of the nodes | edge_label_index with local indices | edge_label_index with n_ids (after mapping) |
---|---|---|---|
skill nodes | 3, 5, 7, 9, 1 | 1, 0, 3, 2 | 5, 3, 9, 7 |
team nodes | 3, 7, 25, 2, 16, 17, 13, 10, 29 | 2, 0, 1, 0 | 25, 3, 7, 3 |
In order to work with the edge_label_indices like in this example, I had to incorporate a mapping like below :
mbatch['skill'].n_id[mbatch['skill','to','team'].edge_label_index[1]]
mbatch['team'].n_id[mbatch['skill','to','team'].edge_label_index[1]]
which produced the edge_label_indices of edge_type skill_to_team with actual n_ids of the skill and team nodes as shown in the last column
@hosseinfani I am trying to ensure whether my approach is correct for testing with random data.
https://github.com/fani-lab/OpeNTF/blob/9f00fd021a9da12cce7ccd2f59df940b45361161/src/main.py#L183
Here, these conditions only satisfy when I set emb_model (to any gnn model) and emb_random = 0 to 3 Here, consider emb as gnn embedding and emb_skill as embedding of only skills.
emb_random = 0 -> will dot product vecs['skill'] with the emb_skill as it is (no randomness) (output shape will be n_teams n_dimensions) emb_random = 1 -> will dot product vecs['skill'] with the emb_skill with emb_skill having random embedding data (output shape will be n_teams n_dimensions) emb_random = 2 -> random sparse matrix vecs['skill'] with values (0,1) of shape (n_teams n_dimensions) emb_random = 3 -> random sparse matrix vecs['skill'] with values (0,1) of shape (n_teams n_skills)
I am focused mostly on embrandom = 1 and 3 because they portray randomness for gnn skill embedding and skill matrix respectively. In short, I am only replacing the skill part of the teamsvecs sparse matrix with randomness_ and then feed into FNN or BNN. The confusion that I am facing is, while training, FNN or BNN will learn to map these random skills with the actual experts in their own way to finally adjust to produce the correct experts. Which means, whatever we feed as skills, they will compare the predicted experts from vecs['member'] and adjust the prediction with gradual learning to eventually predict the correct experts (with the wrong sets of skills). Is my approach correct? Or I am looking at it the wrong way?
Thanks!
Hi @jamil2388 Thanks for the update. You're right, option 1 and 3 make sense.
Regarding your question, if an expert is working on a skill for many teams, like Jamil on GNN, then when we shuffle the skills for the teams of Jamil, the model would learn other skills for Jamil, like ML. Then during the test, a test team for ML would have Jamil, not GNN, so the test results should drop. In another word, when we randomize the skills for experts, the specialty of experts in few skills will be ignored by the model.
I'm in the lab today. we can talk more on this.
@hosseinfani , I will demonstrate some points about hyperparameter tuning observations on FNN in several comments here.
FNN was run on LR 0.1, 0.01 and 0.001 and on different types of losses and negative sampling (none, uniform, unigram, unigram_b, weighted, positive cross entropy)
My theoretical perception is that, if I have a decent enough adjustment of the hyperparameters, I would see a better train vs valid loss curve which would imply that the model is actually learning. So I observed the patterns of the loss curves for these different sets of LRs. 0.001 clearly shows that the model learning at least something from the data, unlike the erratic behavior in case of LR 0.1 and 0.01. The following figure is for the data dblp mt100.ts5 and a distribution t2151.s4289.m3373 :
You can see in the figure that that Row 3 (LR = 0.001) clearly has an understandable learning curve than the other two setups in Rows 1 and 2. My understanding of the Ideal learning curve is aided by this article https://rstudio-conf-2020.github.io/dl-keras-tf/notebooks/learning-curve-diagnostics.nb.html#:~:text=An%20optimal%20fit%20is%20one,zero%20in%20an%20ideal%20situation).
Another part is that each training run is aided by earlystopping. This earlystopping stops the training, whenever the validation loss is no longer improving for patience = 5 number of epochs. As the trainings were run for 50 epochs, earlystopping around halfway (25 epochs) or maybe above would imply that the validation loss still improved. Coincidently, the average earlystoppings in these setups also showed a logical pattern. As a whole, I could infer that, LR 0.001 allowed the model to train more and better based on these observations and also the AUCROC score listed in the document https://docs.google.com/spreadsheets/d/1AA5QCAVnKOjTAj2lNqHzO53mxQnM25zZlcBEprShOlM/edit?usp=sharing Below is the overall inference :
Row | Learning Rate (LR) | EarlyStopping (Mostly around) | Inference |
---|---|---|---|
1 | 0.1 | 5-8 Epochs | No trace of learning |
2 | 0.01 | 5-12 Epochs | No trace of learning |
3 | 0.001 | Above 10 - 50 (No stopping) | Quite some trace of learning |
I will update the other findings successively.
@hosseinfani
This time, I initiated FNN on dblp mt120.ts3 (distribution t34285.s18163.m5381) data for 25 epochs (to quickly gather insights). FNN was run on
Mistaken Random matrices had the entire vecs['skill'] replaced by a random skill matrix. I had to correct it by replacing only the train and valid split of the skill matrix with the random one with appropriate size. The comparison of the results came out interesting. Again, we see a meaningful result only when the LR is 0.001.
For Actual data (non-emb) vs Random data (non-emb random updated), We got 100% expected result when the LR is 0.001. For other LRs, the random AUCROC sometimes outperforms the original one in case of uniform and positive cross entropy setups! And also, the random test is validated with all bad results (red values in column non-emb random updated) compared to the random results of non-emb random column and also other results.
The results are below :
But one mysterious point is, If the skill matrix is replaced by the random matrix (entirely), why is it producing the best scores? (Considering the random nature of the matrix, it should not contribute to a good score overall) You can see that the non-emb random always gets a slightly better score than its counterpart in the non-emb column in the LR 0.001 section.
There are more inferences to be made out of these scores. We can clearly see the concentration of good results (including the best) in the LR 0.001 section and mostly in the weighted section. I am currently ignoring the negative sampling _unigram_b_ because it's taking too much time per instance of training and also producing bad results consistently throughout the experiments. I am trying to optimize the metapath2vec model to check if it crosses the baseline FNN scores (Theoretically, which should be true). I will also add other GNN methods accordingly to get more trends and scores.
Hi @jamil2388 Thanks for the update. Nice that we found the issue finally :) Just a quick reminder about our research questions, which are directing our research: RQ1. Which gnn is the best for our task using transfer learning? RQ2. Which dimension is the best? RQ3. Which classifier is the best (fnn vs. bnn vs. negative samplings)?
@hosseinfani Thanks for responding. While doing experiments, I honestly lost track of the research questions. This will keep me on track, Thanks to your heads up ^_^ I collected some results which might give us a trend at the very least. I will organize and post them soon.
@hosseinfani, I was looking at filtered data generation process (For mt120.ts3), then I found something unusual. The following error occurs :
team.py:167: RuntimeWarning: invalid value encountered in cast
data_[j] = team.get_one_hot(s2i, c2i, l2i, location_type)
Due to this error, the filtered data that generates, somehow does not show any error elsewhere, and the sparse matrix containing team ids (vecs['id']) has a lot of zeros in them (meaning same id in multiple teams!) I presume it occurs because of using location and location based indices. Now is there any workaround for this method? The Team.Bucketing method takes such location based arguments.
Both of these lines get error : https://github.com/fani-lab/OpeNTF/blob/03dbaa242559f91687b2282ea58f17c277082fb1/src/cmn/team.py#L161 https://github.com/fani-lab/OpeNTF/blob/03dbaa242559f91687b2282ea58f17c277082fb1/src/cmn/team.py#L167
Is there any way for me to ignore location based arguments, any modifications that you might suggest? I just need to generate some filtered data. Thanks!
@hosseinfani
I added the file fbnn.py for my testing to check the new bnn. As we discussed, I took this library from https://github.com/IntelLabs/bayesian-torch/tree/main
The code is running on full datasets. But some results that I got so far are lower than the previous version of bnn. So it needs some basic tuning and then some specific tuning. But I am totally unsure about the insides of this model. Could you please take a look at the model? I needed to assure that the model is logically correct and get some instruction on how to tune.
It is under the opentf pipeline, I replaced the train, valid and test calculation portion only with the one instructed by the library. Otherwise all the data handling parts are very similar with the bnn.
The parameter settings at init : https://github.com/fani-lab/OpeNTF/blob/14ada7794b24229b5431008e5463bd204250df20/src/mdl/fbnn.py#L44 train calculation : https://github.com/fani-lab/OpeNTF/blob/14ada7794b24229b5431008e5463bd204250df20/src/mdl/fbnn.py#L153 valid calculation : https://github.com/fani-lab/OpeNTF/blob/14ada7794b24229b5431008e5463bd204250df20/src/mdl/fbnn.py#L165 test calculation : https://github.com/fani-lab/OpeNTF/blob/14ada7794b24229b5431008e5463bd204250df20/src/mdl/fbnn.py#L240
Thanks a lot!
@hosseinfani Also, I am using the basic crossentropyloss function for the loss part in train and valid With the existing workflow, can I just replace the part here with the uniform sampling part? (As I am currently only trying to focus on uniform sampling technique)
crossentropy with 'none' negative sampling: https://github.com/fani-lab/OpeNTF/blob/14ada7794b24229b5431008e5463bd204250df20/src/mdl/fbnn.py#L159 the negative sampling counterpart of bnn : https://github.com/fani-lab/OpeNTF/blob/14ada7794b24229b5431008e5463bd204250df20/src/mdl/bnn.py#L168
Please note that, I dont have the sample_elbo calculation which gives us the layer_loss in the bnn counterpart. Then the layer_loss is aggregated along with the loss calculation. This part has been skipped in my fbnn for the time being
Starting this issue for tracking the learning progress of GNN