Open karan96 opened 1 year ago
To Summarize what we discussed today: - The most probable reason for the bug I am facing is: - The value of the noes is greater than the len(nodes) in our case. So during the time of mini-batch generation, the neighbor sampler takes edge index in the form of source_id and target_id. Here it expects the node ids to be within the range of the sum of all the nodes in the graph, which is in our case 32971. This comprises of all the nodes of experts, skills, and locations. The neighbor sampler expects edge_ids of the nodes. The number of edges in our dataset is: - 149283. So whenever it takes a node id randomly out of the edges (0, 149283) it fails as the node id is not within the range of 32971. The problem now lies with the way edge_list is generated. As discussed, I will try to remap the ids of experts, skills and locations within this limit (0, 32971) and then try to generate the new edge list. After which, I will try embedding generation.
@karan96 thanks for the update. we found the problem which was due to an incorrect adjacency file generation and assigning ids.
@hosseinfani I was able to create the correct edge list this time. And I tried running it on our system but ran into memory issues. I will be running this now on sharcnet. It was in maintenance and now it has recovered. I will update you once I run the code on sharcnet.
The file: - embeddings.pth is a standard file for our implementation(https://github.com/fani-lab/OpeNTF/issues/197) and Radin's work. Lab System: - Met space issues while running for the whole dataset. Graham: - Could not execute the code as it ran into errors. Tried to resolve them with the help of support from sharcnet team but could not do so. Peer's System: - I have ran the code on Yogeshwar's System and would update for any status changes.
Dr. @hosseinfani , I was able to generate the embeddings on Yogeshwar's system. The resultant size of the embeddings is (42427, 128). The number 42427 indicates all the nodes of Experts(13631) + Skills(28796) + Loc(71), The shape that our implemented NN expects is (No. of Teams X No. of Experts/Skills/Loc). For Example: - For Experts the vectors should look something like (165496, 13631) where 165496 is total number of teams in the dataset.
My question is how should I structure the obtained embeddings size (42427,128) into expert embeddings = (165496, 13631) and so on so forth for skills and locs so I can make a run on our NN?
Kindly Suggest.
@karan96
You have to create an embedding for each team's skills by averaging the embeddings of team's skills (skills in that team). Then, the input matrix would be (#Teams X 128). In OpeNTF, if you add _emb
option to the name of a baseline, it tries to find the embeddings of teams' skills in the input of nn:
https://github.com/fani-lab/OpeNTF/blob/main/data/preprocessed/uspt/toy.patent.tsv/skill.docs.pkl
You simply need to replace the embedding file at
https://github.com/fani-lab/OpeNTF/blob/45aa32b1e32edc906d926c7f841a4ec089f34d18/src/main.py#L117
Greetings Dr. @hosseinfani, As discussed I made the run of Location + Skills Graph Embedding with the following hyperparameters. The run including just the graph embeddings for skills is currently running and will update you once done.
'bnn':{
'l': [100],
'lr': 0.1,
'b': 4096,
'e': 20,
'nns': 3,
'ns': 'uniform',
's': 1
},
'emb':{ 'd': 100, 'e': 100, 'dm': 1, 'w': 1 } 'nfolds': 5, 'train_test_split': 0.85
Here are the results of the same: -
<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">
| mean -- | -- P_2 | 0.005494 P_5 | 0.004698 P_10 | 0.003923 recall_2 | 0.002409 recall_5 | 0.0051 recall_10 | 0.008415 ndcg_cut_2 | 0.005628 ndcg_cut_5 | 0.005662 ndcg_cut_10 | 0.006979 map_cut_2 | 0.001976 map_cut_5 | 0.00292 map_cut_10 | 0.003596 aucroc | 0.601312
Issue Page to Track Progress on GNN Implementation on USPT(A).