GNN for Graph based embedding generation

jamil2388 commented 1 year ago

Starting this issue for tracking the learning progress of GNN

jamil2388 commented 1 year ago

I started a new project where I am trying to run a sample node-level classification task for a sample dataset 'Cora'. I created the model, and proceeded with the forward pass properly. I understand how GCNConv layers are basically performing message passing with one hop in each layer. But there is something wrong with the backward pass, it's not updating the weights and so, there is no learning happening. Trying to address this issue.

Also, meanwhile I tried to understand how pyg stores the graph structure for different datasets. I came to know that it has a unified structure for all types of graph which is of the type torch_geometric.data.Data. Also learned how the attributes x, y and edge_index are storing the graph information collectively.

jamil2388 commented 1 year ago

I can create a simple homogeneous graph data now. I will look to build a bigger data for a sample model. Then I will aim to learn heterogeneous graph data generation

jamil2388 commented 1 year ago

@hosseinfani

jamil2388 commented 1 year ago

Sample GCN ran with proper learning with expected accuracy. Changes performed :

Removed unplanned arrangement of function structures
Took the code out of all the functions and put the whole block in main()

Next time, need to implement something that does proper connection between purposeful functions

jamil2388 commented 1 year ago

Started working with dblp input to my simple GNN (with possibly 1 GCNConv layer to start with)
Tried to use read_data() from the Publication class in OpeNTF
Also tried to avoid generating the sparse matrix at first (Only trying to perform the read_data on raw dblp toy data)
In the Publication class, while reading the data in read_data() and populating the teams, the condition "if 'nrow' in settings['data']['domain']['dblp'].keys() and len(teams) > settings['data']['domain']['dblp']['nrow']: break" did not contain the 'data' key previously, which caused an error finding the subsequent child keys. I resolved it temporarily by including the 'data' key. I wonder why it is like that here. A point to note that the branch jamil_test1 in my fork has been created from the cikm22 branch
Couldnt find the data/raw/dblp/ at first. Fixed it by setting the working directory as the OpeNTF project directory in pycharm, although the 'src' is marked as the source directory (Need to figure the issue out)
With the previous setting of 'src' as both src and working directory, the OpeNTF main.py could easily find the data inside raw folders
The data I gathered now contains an index created and also a sample data of 31 rows with object type Publication (Need to understand how to manipulate that)
Aiming to create a graph with the data that I read using read_data

hosseinfani commented 1 year ago

Hi @jamil2388 thanks for the update.

can you post the link to the exact line of code? I'm not sure I understood the logic of that line (we shouldn't have that line)

jamil2388 commented 1 year ago

Sorry @hosseinfani, I couldnt find a way to post the link to the exact line of code. But I will include the location of the line. It is in Line 84 of Publication class of the cikm22 branch

hosseinfani commented 1 year ago

@jamil2388

you mean this line? https://github.com/fani-lab/OpeNTF/blob/b9357eb8b89af43333ed15218fe20d3dfa77ba62/src/cmn/publication.py#L84

now I remember it :D I wanted to run the pipeline for the first nrow of the sparse matrix (kind of hidden feature for us, not general users :D)

btw, if you click on the side of codelines, you'll have options to create link for any codeline.

jamil2388 commented 1 year ago

Yes exactly, this line. I just needed to read the data without generating the sparse_matrix initially. So, should I comment this line out, or maybe just do my stuff with this modification for the time being? "if 'nrow' in settings['data']['domain']['dblp'].keys() and len(teams) > settings['data']['domain']['dblp']['nrow']: break"

jamil2388 commented 1 year ago

Mentioning @mahdis-saeedi in this thread to keep her posted with the updates in this GNN issue line.

hosseinfani commented 1 year ago

@jamil2388 no, the settings should have data > domain > {domain name} in param.py. but the existence of nrow is optional

jamil2388 commented 1 year ago

@hosseinfani please correct me if I am wrong and let me know what I am missing https://github.com/fani-lab/OpeNTF/blob/b9357eb8b89af43333ed15218fe20d3dfa77ba62/src/cmn/team.py#L94C9-L94C9

In load_data of Team class, we are basically trying to load any existing index / team data
In the referenced line, after loading the index, loading of the teams.pkl should have been as usual (if exists, load the data, otherwise return None), but the section depends on the bool index. Why should it NOT LOAD teams if index is passed as False?

hosseinfani commented 1 year ago

I'll be in lab after 4pm. We can review it together. thanks

@jamil2388

jamil2388 commented 1 year ago

Based on the data we are going to work with, we would need Heterogeneous Graphs. In pyg we will need to work with the class HeteroData to represent the graphs
For the time being I am working on producing the graph with only Team, Skill and Expert data (excluding location for avoiding complexities right now)
The team_graph will contain 3 different node types team, skill and expert. And the edges will subsequently be different based on the connection t2s (team to skill) and t2e (team to skill)
Trying to generate empty nodes without any features incorporated. So 31 team nodes will be contained in the graph as team_graph['team'].x where x will have to be a torch.tensor of dimension (31, 1). But so far I cannot generate any tensor with Empty / None type data. Trying to find a workaround on that

jamil2388 commented 1 year ago

Troubleshooting

Node2Vec required some packages of either pyg-lib or torch-cluster
Normal pip install torch_cluster did not do the trick
Installed the optional dependencies of pyg mentioned in their documentation from this command pip install torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.0.0+cpu.html
The command should be adjusted according to the pytorch and OS version (The adjustments are mentioned in the documentation link)

jamil2388 commented 1 year ago

Created a model to generate Transductive (Shallow) embeddings using torch_geometric.nn.Node2Vec, which only works for homogeneous graphs.
Tested on Cora dataset and a simple 3-noded graph to generate the Node embeddings of preferred size
The model reference has been taken from https://pytorch-geometric.readthedocs.io/en/latest/tutorial/shallow_node_embeddings.html
Unfortunately, the generated embeddings are not yet compared to any reference values or any ground truth (which can determine the validity of the embeddings)
But according to the referenced documentation, this Node2Vec can generate the node embeddings based on the positive and negative random walks from the nodes. The loss is calculated from these positive and negative random walk results in each iteration
Need to discover what is happening to generate these embeddings
How good these embeddings are in terms of a reference value
Need to use Metapath2Vec to fit the dblp toy data (HeteroData) to generate the embeddings

hosseinfani commented 1 year ago

@jamil2388 thanks for the update. Can you direct me to the code?

jamil2388 commented 1 year ago

@hosseinfani As I am working on my own fork, I am including the link to the main function of gnn_emb.py that is attempting to do the job : https://github.com/jamil2388/OpeNTF_Jamil/blob/7f81801760de926d1ade24b432c84d8483a5dd00/src/mdl/gnn_emb.py#L55

hosseinfani commented 1 year ago

@jamil2388 I'll be in lab 12-2pm. Let's do a quick code review. Thanks.

jamil2388 commented 1 year ago

Link to the code where it generates some plots of the losses around some sets of epochs ([100, 200, 500, 1000, 1500, 2000]) and also prints logged texts showing the embeddings after each set of epochs https://github.com/jamil2388/OpeNTF_Jamil/blob/29fa7f048d49835955bc8c552c9dcbe0fc27728f/src/mdl/gnn_emb.py#L122
@hosseinfani, One of the outputs of the log texts is given below https://github.com/jamil2388/OpeNTF_Jamil/blob/795625d25485f24859d19e78de9a09525e71660a/output/custom/gnn_emb/node2vec/performance_log_200_epochs.txt#L1
From the plots, it can be estimated that the optimization is coming when the num_epochs is 200
Need to show the values of the embeddings as lists rather than tensors
Need to round the values to get an easier perception
Need to input a more enriched graph to notice more pronounced improvements

jamil2388 commented 1 year ago

The code supports input data from any teamsvecs.pkl (sparse_matrix) containing id, skill, experts, (location also supported, but not tested) to create a homogeneous graph
The homogeneous graph is stored as a pickle file for lazy load to the next tasks
Any homogeneous graph can be fed into the node2vec to first plot the graph in networkx and then generate embeddings of preferred size
The output logs are at the same folder like before
The embedding generation is supported only for the graphs with a single type of node (E-E, S-S or T-T)

hosseinfani commented 1 year ago

@jamil2388 Thanks for the update. Please do the following:

send your pr and start working on our codeline directly (don't work on your fork)
you need to come up with a pipeline of steps like ['graph', 'train'] and save the result of each step in files
you need to save the graph file in pkl such that other gnn method use the same graph instead of creating their own graphs each time.
you need to read the parameters from a param.py file instead of hardcode the values in your codelines.

Will talk to you on Wed for a quick code review.

jamil2388 commented 1 year ago

@hosseinfani

I am discontinuing to work on the fork.
Created a branch gnn from the main branch in OpeNTF
Copied my work into the branch gnn (which is perfectly running as of now without any conflicts)
Currently src > misc > data_handler creates any sort of graph (homogeneous / heterogeneous) with any valid combination of edges (as discussed)
The src > graph_params now lists a lot of parameters which are alleviating hardcoding
The src > mdl > node2vec and src > mdl > metapath2vec is running without errors and generating output
Pushed some sample output files in the data > graph folder (both preprocessed embedding logs and the raw graph files)
Working on removing hardcoding as much as possible
Working on integrating the modularity of the workflow
Will be updating the later models in this branch

jamil2388 commented 1 year ago

@hosseinfani I have refactored the code. You can take a look at the following portions

src > graph_main.py contains the main modular portion https://github.com/fani-lab/OpeNTF/blob/801f64edb348c0330876ed8c474172dbbf506c7c/src/graph_main.py
src > graph_params.py contains all the parameters https://github.com/fani-lab/OpeNTF/blob/801f64edb348c0330876ed8c474172dbbf506c7c/src/graph_params.py
src > mdl > graph.py is the base class for node2vec.py and metapath2vec.py currently
Here I mentioned the variables you need to look at and set each time in the graph_params.py explicitly https://github.com/fani-lab/OpeNTF/blob/801f64edb348c0330876ed8c474172dbbf506c7c/src/graph_main.py#L12
Currently, you can only loop through all the domains mentioned. In each loop, cmd -> 'graph' makes the code first generate and save the graph, cmd -> 'emb' then loads the graph data to generate the embedding (with one single model and one specific type of edge set for the entire loop) Thanks for your instructions.

hosseinfani commented 1 year ago

Hi @jamil2388 I made a huge change to your code. Still, not complete though. Probably, I finish it by Monday. Please have a look while I'm finishing the code refactor.

You need to code more efficiently as we are going to work with large-scale graphs.

I'll talk to you soon.

jamil2388 commented 1 year ago

@hosseinfani Thank you so much for the changes. I am looking into those meanwhile

jamil2388 commented 11 months ago

@hosseinfani Some updates regarding the changes that have been incorporated and also the ones ready to deploy from my local copy to this repo :

Incorporated GCN for homogeneous graphs only. The code is adaptable for all types of graphs. But unfortunately, GCNConv layers need a special type of message passing which is only doable in our homogeneous graphs right now. Still looking for a workaround to adapt the message passing which requires self loops.
Based on the outcome of GCN, I can deploy GraphSAGE in this repo. The code is in the local copy and is very similar to the structure of GCN. If GCN adapts with the pipeline of gnn, I can apply GraphSAGE to the remote repo. GraphSAGE can run on any types of graphs (Except Directed graphs only).
I have run GCN and GraphSAGE for 4 main pickle files, and the outcomes are as below :
Data --------------------------------------------------------Loss (Epoch) ------------------------------- Time taken for total epochs m.undir.none.data.pkl (GCN) ----------------------- .0292 (100) || .0014 (1000) ---------------------- 45.8 minutes (1000) m.undir.mean.data.pkl (GCN) ----------------------- .0300 (100) || .016 (1000) ------------------------ 46.11 minutes (1000) stm.undir.none.data.pkl (GraphSAGE) -------------- .0004 (100) --------------------------------------- 2.2 hours (100) stm.undir.mean.data.pkl (GraphSAGE) ------------- .0008 (100) --------------------------------------- 44.75 minutes (100)

Here, I did not use any test phase. The losses are from the training phase. Also as I mentioned the issue of mini batching on multiple edge types, I had to use the entire training data in one go (unbatched). Only then the model ran successfully. I have the test loss calculated, but it definitely needs some correction.

Issues :

For trying GCN on heterogeneous graphs, the error occurs while message passing : "ValueError: 'GCNConv' received a tuple of node features as input while this layer does not support bipartite message passing. Please try other layers such as 'SAGEConv' or 'GraphConv' instead"
For trying GraphSAGE on directed graphs, the following error occurs : "ValueError: Cannot generate a graph node 'relu' for type 'skill' since it does not exist. Please make sure that all node types get updated during message passing." (I assume this is purely for the type of edges we are using. I think giving an edge from skill to member will allow a better message passing and cause every type of node to update)
"UserWarning: There exist node types ({'skill', 'member'}) whose representations do not get updated during message passing as they do not occur as destination type in any edge type. This may lead to unexpected behavior"
"UserWarning: The type '-' contains invalid characters which may lead to unexpected behavior. To avoid any issues, ensure that your types only contain letters, numbers and underscores." Relevant issue discussion https://github.com/pyg-team/pytorch_geometric/discussions/8211#discussioncomment-7396118 Solution : Get rid of the '-' in the edge type names

I will update some learnings of mine on a later post. Thanks!

hosseinfani commented 11 months ago

Hi @jamil2388 Thanks for the update. Please integrate them into our pipeline asap. So, for now, we can run the pipeline for the homogeneous graphs for available gnn methods.

hosseinfani commented 11 months ago

@jamil2388 I'm thinking that at gnn phase, we give the entire graph for training. so, no need for train/test splits.

However, later, when we great the graphs, we do the split at graph generation phase.

jamil2388 commented 11 months ago

@hosseinfani, I uploaded the gs_layer class. This class has the definitions for GraphSAGE implementations. In my local experiment, I kept separate structures for separate models like this GCN (init_model, train, learn) GCN_Layer (the layers for GCN model)

GS (init_model, train, learn) GS_Layer (the layers for GraphSAGE model)

But right now, the GS and GCN classes have exactly same implementation. I did not add GS class file because I the GCN class might be undergoing some refactors by you. I created separate layer class files because those classes needed to inherit torch.nn.module class.

One think that I feel that, the parts of the GCN and GS classes and maybe for other GNN models can be taken to the existing GNN class. (where it is right now designed to do all the common works) Then from GNN class, we create specific instances of the models (gs_layer or gcn_layer) based on the parameters.

hosseinfani commented 11 months ago

@jamil2388 Let's have a meeting and do some pair programming. I'll be available this week in lab.

jamil2388 commented 11 months ago

@hosseinfani that would be great.

jamil2388 commented 11 months ago

@hosseinfani , I am adding the following features of the current models we have ready for running

Model	Homogeneous	Heterogeneous	Undirected	Directed	Duplicatededges
Node2vec	Yes	No	Need to Confirm	Need to Confirm	Yes
Metapath2vec	No	Yes	Need to Confirm	Need to Confirm	Yes
GCN	Yes	No	Yes	No	Yes
GraphSAGE	Yes	Yes	Yes	No	Yes
GAT	Yes	Yes	Yes	No	Yes
GIN	Yes	Yes	Yes	No	Yes

The directed graphs are in general making problems for message passing. For Heterogeneous graphs, they dont run the model at all, For homogeneous graphs, they run the model, but surely the message passing is not able to reach certain nodes so the learning is not happening at all
Right now we have GCN class holding the common model running and learning stuffs and the GCN_Layer, GS_Layer and GAT_Layer classes to construct the different models. If the pipeline is shaped up, maybe the GCN class can be discarded and those methods can be run directly from the general GNN class based on different model parameters.
GIN model references : https://github.com/pyg-team/pytorch_geometric/blob/master/examples/jit/gin.py https://mlabonne.github.io/blog/posts/2022-04-25-Graph_Isomorphism_Network.html

mahdis-saeedi commented 11 months ago

Here some properties of different GNN models are categorized and the models support heterogenous graphs are mentioned. https://pytorch-geometric.readthedocs.io/en/latest/cheatsheet/gnn_cheatsheet.html

jamil2388 commented 11 months ago

@hosseinfani I updated the repo with the GAT and GIN models (gat_layer and gin_layer) classes (works on both homo and hetero, updated in the previous table). I also tested out that with negative_sampling disabled, the models are producing very good test loss values compared to the previous test losses produced with negative_sampling enabled (very bad).

I needed some directions and help from you. As you mentioned about the node2vec and metapath2vec previously, you were editing some portions of node2vec on the main pipeline. I was looking to refactor my gnn classes according to the structure you give in the pipeline for node2vec or metapath2vec. If I had done the same code, it would be very erroneous. Right now, my gnn classes are waiting to be included in the pipeline (main.py) and also I was looking to modify the gnn.py taking the generalized portions from all the gnn models. A pair programming or a discussion session would be great for me, if possible from your side. Thanks!

hosseinfani commented 11 months ago

@jamil2388 Thank you for the update. I'm busy with finalizing my courses but we can meet early next week, Monday ....

jamil2388 commented 11 months ago

@hosseinfani, ⁠ the preprocessed folder now contains embeddings for dblp, imdb and uspt (GS, GCN, GAT, GIN). Except GAT, all of these models were run on CUDA. For the GAT part, it is causing cuda-out-of-memory error. I am still trying to figure that out (Probably, the computation cost is exceeding when we select the parameter "heads = 8" (standard value taken from the paper). Other than that, the timings and losses for the trainings are in this file in the "Emb" sheet :

https://docs.google.com/spreadsheets/d/1pz86JQ0a8XeX0AeXt07ayOVE7cat3Qw0FapuqIrzRR0/edit?usp=sharing

jamil2388 commented 10 months ago

Currently the results of test runs of OpeNTF with different sets of generated embeddings are logged in this google docs https://docs.google.com/spreadsheets/d/1AA5QCAVnKOjTAj2lNqHzO53mxQnM25zZlcBEprShOlM/edit?usp=sharing

jamil2388 commented 10 months ago

The Train, Test, Eval phases of OpeNTF with some GNN embeddings are being tested on
The Train and Test is completing in a reasonable time for nfolds = 3, epochs = 10
The Eval is taking too long to generate the evaluation csv files. For,

Train Test Split	Prediction File Size	Evaluation status
0.85	22 GB	Never completes
0.95	7.6 GB	Sometimes gets killed, sometimes holds on
0.99	1.5 GB	Completes

Might try out parallel processing as applied in this instance in Adila https://github.com/fani-lab/Adila/blob/08fa064291c75ccf7806a5eb0e315277b99b8529/src/main.py#L446
Might try out pandas cuda utilization https://www.youtube.com/watch?v=OnYGtKQT-rU

jamil2388 commented 10 months ago

@hosseinfani, while trying to produce gnn->fnn results from imdb mt5.ts2 datasets, I failed a lot of times due to the loading times of prediction files in the eval phase (mentioned in the earlier comment). Now as I got some outcomes for the mt75.ts3 dataset (with split_ratio 0.85), (you can check some roc_auc_scores from the previously mentioned link https://docs.google.com/spreadsheets/d/1AA5QCAVnKOjTAj2lNqHzO53mxQnM25zZlcBEprShOlM/edit?usp=sharing), the pred file size has come down drastically to ~40 MB! But I am concerned about some stuffs happening :

The early stopping is not occurring when I switched from none to 'unigram_b' negative sampling in fnn. Which means, the training is running for entire 10 epochs and 3 folds. It is to note that I only changed the negative sampling setting in the param file. For your convenience, here is the link for the early stopping in fnn https://github.com/fani-lab/OpeNTF/blob/579e089fd44f9f590673dbb07ff87c272fd9d087/src/mdl/fnn.py#L262
The image shows the last portion of the training on GIN 500 epochs dim 64> FNN_unigram_b 10 epochs. If I am not mistaken, the early stopping is not triggering upto counter 5 unlike the previous FNN none setting.
You can see the trend here when the pipeline is running for 10 epochs and 3 folds for two GNN sm mean 64
And for GAT stm mean 32
The training time is taking almost 2 hours sometimes. Should I change any setting here? Like the learning rate (currently 0.001), epochs, early stopping, folds or anything else?

Thanks!

jamil2388 commented 10 months ago

Notes about a finding in GNN batching issue :

Problem : While generating mini batches from a data with the LinkNeighborLoader (specific for link prediction purposes), for a single mbatch (mini batch), there are several components. For example : from the train_data split with graph type stm (Skill - Team, Member - Team), if we want to generate a mbatch, the mbatch will contain the following parts

 HeteroData(
  member={
    x=[3, 1],
    n_id=[3],
  },
  team={
    x=[9, 1],
    n_id=[9],
  },
  skill={
    x=[5, 1],
    n_id=[5],
  },
  (skill, to, team)={
    edge_index=[2, 3],
    edge_attr=[3],
    edge_label=[4],
    edge_label_index=[2, 4],
    e_id=[3],
    input_id=[2],
  },
  (member, to, team)={
    edge_index=[2, 3],
    edge_attr=[3],
    edge_label=[16],
    edge_label_index=[2, 16],
    e_id=[3],
  },
  ## reverse edge stuffs
  (team, rev_to, skill)={
    edge_index=[2, 6],
    edge_attr=[6],
    e_id=[6],
  },
  (team, rev_to, member)={
    edge_index=[2, 0],
    edge_attr=[0],
    e_id=[0],
  }
)

We can see that n_id and edge_label_index gets generated and if negative sampling is enabled, there will be negative edges in this edge_label_index with the corresponding edge_labels 0. The problem is, it should be obvious that the node ids mentioned in edge_label_indices should be present in the list of n_id of the individual node types. But unfortunately, if we consider to map the n_id from edge_label_indices to there respective node_type n_ids, they do not match.

Solution : As discussed in this section of slack by the pyg team https://torchgeometricco.slack.com/archives/C01DN0B3B1N/p1701860291891099?thread_ts=1701354805.778269&cid=C01DN0B3B1N Also another small reference of "mapping n_id back'' here https://github.com/pyg-team/pytorch_geometric/discussions/7797#discussioncomment-6549639

The nodes mentioned in the edge_label_indices of the mbatches are locally generated n_ids pointing to the position / index of the global n_ids, unlike the global n_ids in the individual node_types.

Example : In the mentioned mbatch we have,

----	n_id of the nodes	edge_label_index with local indices	edge_label_index with n_ids (after mapping)
skill nodes	3, 5, 7, 9, 1	1, 0, 3, 2	5, 3, 9, 7
team nodes	3, 7, 25, 2, 16, 17, 13, 10, 29	2, 0, 1, 0	25, 3, 7, 3

In order to work with the edge_label_indices like in this example, I had to incorporate a mapping like below :

mbatch['skill'].n_id[mbatch['skill','to','team'].edge_label_index[1]]
mbatch['team'].n_id[mbatch['skill','to','team'].edge_label_index[1]]

which produced the edge_label_indices of edge_type skill_to_team with actual n_ids of the skill and team nodes as shown in the last column

jamil2388 commented 8 months ago

@hosseinfani I am trying to ensure whether my approach is correct for testing with random data.

https://github.com/fani-lab/OpeNTF/blob/9f00fd021a9da12cce7ccd2f59df940b45361161/src/main.py#L183

Here, these conditions only satisfy when I set emb_model (to any gnn model) and emb_random = 0 to 3 Here, consider emb as gnn embedding and emb_skill as embedding of only skills.

emb_random = 0 -> will dot product vecs['skill'] with the emb_skill as it is (no randomness) (output shape will be n_teams n_dimensions) emb_random = 1 -> will dot product vecs['skill'] with the emb_skill with emb_skill having random embedding data (output shape will be n_teams n_dimensions) emb_random = 2 -> random sparse matrix vecs['skill'] with values (0,1) of shape (n_teams n_dimensions) emb_random = 3 -> random sparse matrix vecs['skill'] with values (0,1) of shape (n_teams n_skills)

I am focused mostly on embrandom = 1 and 3 because they portray randomness for gnn skill embedding and skill matrix respectively. In short, I am only replacing the skill part of the teamsvecs sparse matrix with randomness_ and then feed into FNN or BNN. The confusion that I am facing is, while training, FNN or BNN will learn to map these random skills with the actual experts in their own way to finally adjust to produce the correct experts. Which means, whatever we feed as skills, they will compare the predicted experts from vecs['member'] and adjust the prediction with gradual learning to eventually predict the correct experts (with the wrong sets of skills). Is my approach correct? Or I am looking at it the wrong way?

Thanks!

hosseinfani commented 8 months ago

Hi @jamil2388 Thanks for the update. You're right, option 1 and 3 make sense.

Regarding your question, if an expert is working on a skill for many teams, like Jamil on GNN, then when we shuffle the skills for the teams of Jamil, the model would learn other skills for Jamil, like ML. Then during the test, a test team for ML would have Jamil, not GNN, so the test results should drop. In another word, when we randomize the skills for experts, the specialty of experts in few skills will be ignored by the model.

I'm in the lab today. we can talk more on this.

jamil2388 commented 8 months ago

@hosseinfani , I will demonstrate some points about hyperparameter tuning observations on FNN in several comments here.

Effect of Learning Rate (LR)

FNN was run on LR 0.1, 0.01 and 0.001 and on different types of losses and negative sampling (none, uniform, unigram, unigram_b, weighted, positive cross entropy)

My theoretical perception is that, if I have a decent enough adjustment of the hyperparameters, I would see a better train vs valid loss curve which would imply that the model is actually learning. So I observed the patterns of the loss curves for these different sets of LRs. 0.001 clearly shows that the model learning at least something from the data, unlike the erratic behavior in case of LR 0.1 and 0.01. The following figure is for the data dblp mt100.ts5 and a distribution t2151.s4289.m3373 :

You can see in the figure that that Row 3 (LR = 0.001) clearly has an understandable learning curve than the other two setups in Rows 1 and 2. My understanding of the Ideal learning curve is aided by this article https://rstudio-conf-2020.github.io/dl-keras-tf/notebooks/learning-curve-diagnostics.nb.html#:~:text=An%20optimal%20fit%20is%20one,zero%20in%20an%20ideal%20situation).

Another part is that each training run is aided by earlystopping. This earlystopping stops the training, whenever the validation loss is no longer improving for patience = 5 number of epochs. As the trainings were run for 50 epochs, earlystopping around halfway (25 epochs) or maybe above would imply that the validation loss still improved. Coincidently, the average earlystoppings in these setups also showed a logical pattern. As a whole, I could infer that, LR 0.001 allowed the model to train more and better based on these observations and also the AUCROC score listed in the document https://docs.google.com/spreadsheets/d/1AA5QCAVnKOjTAj2lNqHzO53mxQnM25zZlcBEprShOlM/edit?usp=sharing Below is the overall inference :

Row	Learning Rate (LR)	EarlyStopping (Mostly around)	Inference
1	0.1	5-8 Epochs	No trace of learning
2	0.01	5-12 Epochs	No trace of learning
3	0.001	Above 10 - 50 (No stopping)	Quite some trace of learning

I will update the other findings successively.

jamil2388 commented 8 months ago

@hosseinfani

Comparison against Random and Metapath2Vec baselines :

This time, I initiated FNN on dblp mt120.ts3 (distribution t34285.s18163.m5381) data for 25 epochs (to quickly gather insights). FNN was run on

Sparse Matrices, [non-emb]
Random Sparse Matrices (the mistaken one), [non-emb random]
Random Sparse Matrices (the corrected one), [non-emb random updated]
Metapath2Vec Dense Matrices (size - 32, 64, 128, epoch - 100, negative sampling - 5 etc.) [m2v.e100.dX]

Mistaken Random matrices had the entire vecs['skill'] replaced by a random skill matrix. I had to correct it by replacing only the train and valid split of the skill matrix with the random one with appropriate size. The comparison of the results came out interesting. Again, we see a meaningful result only when the LR is 0.001.

For Actual data (non-emb) vs Random data (non-emb random updated), We got 100% expected result when the LR is 0.001. For other LRs, the random AUCROC sometimes outperforms the original one in case of uniform and positive cross entropy setups! And also, the random test is validated with all bad results (red values in column non-emb random updated) compared to the random results of non-emb random column and also other results.

The results are below :

But one mysterious point is, If the skill matrix is replaced by the random matrix (entirely), why is it producing the best scores? (Considering the random nature of the matrix, it should not contribute to a good score overall) You can see that the non-emb random always gets a slightly better score than its counterpart in the non-emb column in the LR 0.001 section.

There are more inferences to be made out of these scores. We can clearly see the concentration of good results (including the best) in the LR 0.001 section and mostly in the weighted section. I am currently ignoring the negative sampling _unigram_b_ because it's taking too much time per instance of training and also producing bad results consistently throughout the experiments. I am trying to optimize the metapath2vec model to check if it crosses the baseline FNN scores (Theoretically, which should be true). I will also add other GNN methods accordingly to get more trends and scores.

hosseinfani commented 7 months ago

Hi @jamil2388 Thanks for the update. Nice that we found the issue finally :) Just a quick reminder about our research questions, which are directing our research: RQ1. Which gnn is the best for our task using transfer learning? RQ2. Which dimension is the best? RQ3. Which classifier is the best (fnn vs. bnn vs. negative samplings)?

jamil2388 commented 7 months ago

@hosseinfani Thanks for responding. While doing experiments, I honestly lost track of the research questions. This will keep me on track, Thanks to your heads up ^_^ I collected some results which might give us a trend at the very least. I will organize and post them soon.

jamil2388 commented 7 months ago

@hosseinfani, I was looking at filtered data generation process (For mt120.ts3), then I found something unusual. The following error occurs :

team.py:167: RuntimeWarning: invalid value encountered in cast
  data_[j] = team.get_one_hot(s2i, c2i, l2i, location_type)

Due to this error, the filtered data that generates, somehow does not show any error elsewhere, and the sparse matrix containing team ids (vecs['id']) has a lot of zeros in them (meaning same id in multiple teams!) I presume it occurs because of using location and location based indices. Now is there any workaround for this method? The Team.Bucketing method takes such location based arguments.

https://github.com/fani-lab/OpeNTF/blob/03dbaa242559f91687b2282ea58f17c277082fb1/src/cmn/team.py#L150

Both of these lines get error : https://github.com/fani-lab/OpeNTF/blob/03dbaa242559f91687b2282ea58f17c277082fb1/src/cmn/team.py#L161 https://github.com/fani-lab/OpeNTF/blob/03dbaa242559f91687b2282ea58f17c277082fb1/src/cmn/team.py#L167

Is there any way for me to ignore location based arguments, any modifications that you might suggest? I just need to generate some filtered data. Thanks!

jamil2388 commented 6 months ago

@hosseinfani

I added the file fbnn.py for my testing to check the new bnn. As we discussed, I took this library from https://github.com/IntelLabs/bayesian-torch/tree/main

The code is running on full datasets. But some results that I got so far are lower than the previous version of bnn. So it needs some basic tuning and then some specific tuning. But I am totally unsure about the insides of this model. Could you please take a look at the model? I needed to assure that the model is logically correct and get some instruction on how to tune.

It is under the opentf pipeline, I replaced the train, valid and test calculation portion only with the one instructed by the library. Otherwise all the data handling parts are very similar with the bnn.

The parameter settings at init : https://github.com/fani-lab/OpeNTF/blob/14ada7794b24229b5431008e5463bd204250df20/src/mdl/fbnn.py#L44 train calculation : https://github.com/fani-lab/OpeNTF/blob/14ada7794b24229b5431008e5463bd204250df20/src/mdl/fbnn.py#L153 valid calculation : https://github.com/fani-lab/OpeNTF/blob/14ada7794b24229b5431008e5463bd204250df20/src/mdl/fbnn.py#L165 test calculation : https://github.com/fani-lab/OpeNTF/blob/14ada7794b24229b5431008e5463bd204250df20/src/mdl/fbnn.py#L240

Thanks a lot!

jamil2388 commented 6 months ago

@hosseinfani Also, I am using the basic crossentropyloss function for the loss part in train and valid With the existing workflow, can I just replace the part here with the uniform sampling part? (As I am currently only trying to focus on uniform sampling technique)

crossentropy with 'none' negative sampling: https://github.com/fani-lab/OpeNTF/blob/14ada7794b24229b5431008e5463bd204250df20/src/mdl/fbnn.py#L159 the negative sampling counterpart of bnn : https://github.com/fani-lab/OpeNTF/blob/14ada7794b24229b5431008e5463bd204250df20/src/mdl/bnn.py#L168

Please note that, I dont have the sample_elbo calculation which gives us the layer_loss in the bnn counterpart. Then the layer_loss is aggregated along with the loss calculation. This part has been skipped in my fbnn for the time being

fani-lab / OpeNTF