fani-lab / OpeNTF

Neural machine learning methods for Team Formation problem.
Other
18 stars 12 forks source link

GNN Implementation on USPT(A) #196

Open karan96 opened 1 year ago

karan96 commented 1 year ago

Issue Page to Track Progress on GNN Implementation on USPT(A).

karan96 commented 1 year ago

To Summarize what we discussed today: - The most probable reason for the bug I am facing is: - The value of the noes is greater than the len(nodes) in our case. So during the time of mini-batch generation, the neighbor sampler takes edge index in the form of source_id and target_id. Here it expects the node ids to be within the range of the sum of all the nodes in the graph, which is in our case 32971. This comprises of all the nodes of experts, skills, and locations. The neighbor sampler expects edge_ids of the nodes. The number of edges in our dataset is: - 149283. So whenever it takes a node id randomly out of the edges (0, 149283) it fails as the node id is not within the range of 32971. The problem now lies with the way edge_list is generated. As discussed, I will try to remap the ids of experts, skills and locations within this limit (0, 32971) and then try to generate the new edge list. After which, I will try embedding generation.

hosseinfani commented 1 year ago

@karan96 thanks for the update. we found the problem which was due to an incorrect adjacency file generation and assigning ids.

karan96 commented 1 year ago

@hosseinfani I was able to create the correct edge list this time. And I tried running it on our system but ran into memory issues. I will be running this now on sharcnet. It was in maintenance and now it has recovered. I will update you once I run the code on sharcnet.

karan96 commented 1 year ago

The file: - embeddings.pth is a standard file for our implementation(https://github.com/fani-lab/OpeNTF/issues/197) and Radin's work. Lab System: - Met space issues while running for the whole dataset. Graham: - Could not execute the code as it ran into errors. Tried to resolve them with the help of support from sharcnet team but could not do so. Peer's System: - I have ran the code on Yogeshwar's System and would update for any status changes.

karan96 commented 1 year ago

Dr. @hosseinfani , I was able to generate the embeddings on Yogeshwar's system. The resultant size of the embeddings is (42427, 128). The number 42427 indicates all the nodes of Experts(13631) + Skills(28796) + Loc(71), The shape that our implemented NN expects is (No. of Teams X No. of Experts/Skills/Loc). For Example: - For Experts the vectors should look something like (165496, 13631) where 165496 is total number of teams in the dataset.

My question is how should I structure the obtained embeddings size (42427,128) into expert embeddings = (165496, 13631) and so on so forth for skills and locs so I can make a run on our NN?

Kindly Suggest.

hosseinfani commented 1 year ago

@karan96 You have to create an embedding for each team's skills by averaging the embeddings of team's skills (skills in that team). Then, the input matrix would be (#Teams X 128). In OpeNTF, if you add _emb option to the name of a baseline, it tries to find the embeddings of teams' skills in the input of nn:

https://github.com/fani-lab/OpeNTF/blob/main/data/preprocessed/uspt/toy.patent.tsv/skill.docs.pkl

You simply need to replace the embedding file at

https://github.com/fani-lab/OpeNTF/blob/45aa32b1e32edc906d926c7f841a4ec089f34d18/src/main.py#L117

karan96 commented 1 year ago

Greetings Dr. @hosseinfani, As discussed I made the run of Location + Skills Graph Embedding with the following hyperparameters. The run including just the graph embeddings for skills is currently running and will update you once done.

'bnn':{ 'l': [100],
'lr': 0.1,
'b': 4096,
'e': 20,
'nns': 3, 'ns': 'uniform',
's': 1
},

'emb':{ 'd': 100, 'e': 100, 'dm': 1, 'w': 1 } 'nfolds': 5, 'train_test_split': 0.85

Here are the results of the same: -

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

  | mean -- | -- P_2 | 0.005494 P_5 | 0.004698 P_10 | 0.003923 recall_2 | 0.002409 recall_5 | 0.0051 recall_10 | 0.008415 ndcg_cut_2 | 0.005628 ndcg_cut_5 | 0.005662 ndcg_cut_10 | 0.006979 map_cut_2 | 0.001976 map_cut_5 | 0.00292 map_cut_10 | 0.003596 aucroc | 0.601312

hosseinfani commented 1 year ago

@karan96 thank you. now, we need the result without location, same running settings

karan96 commented 1 year ago

Greetings @hosseinfani , Here are the results of the runs we did this weekend: - GNN Embeddings with Loc and Skill: -

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

  | mean -- | -- P_2 | 0.011424 P_5 | 0.010253 P_10 | 0.008126 recall_2 | 0.005478 recall_5 | 0.01175 recall_10 | 0.018565 ndcg_cut_2 | 0.011544 ndcg_cut_5 | 0.012323 ndcg_cut_10 | 0.015117 map_cut_2 | 0.004262 map_cut_5 | 0.006315 map_cut_10 | 0.007495 aucroc | 0.796098

GNN Embeddings of Only Skills: -

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

  | mean -- | -- P_2 | 0.011919 P_5 | 0.010248 P_10 | 0.008383 recall_2 | 0.005681 recall_5 | 0.011572 recall_10 | 0.018667 ndcg_cut_2 | 0.011973 ndcg_cut_5 | 0.012347 ndcg_cut_10 | 0.01529 map_cut_2 | 0.004255 map_cut_5 | 0.006221 map_cut_10 | 0.007627 aucroc | 0.797203

The results of both of these methods are pretty close to each other. What should be our interpretation from this?

hosseinfani commented 1 year ago

@karan96 we discussed the result in lab. we need to compare the results with other baselines we have and analyse all together.

karan96 commented 1 year ago

@hosseinfani Based upon our discussion, I verified the code that was generating the adjacency matrix and it is working fine and as expected. So we are good with that. I then experimented with different parameter settings for walk_length and num_walks_per_node and it turns out these were the two parameters that were affecting the coverage random walks were having for the graph and hence we were not getting embeddings for all the nodes. During my experimentation, I reached some 40,000 nodes embeddings then to 50k and then to 60k and finally reached 83310 node embedding with walk_length = 64 and num_walks_per_node = 40. Our total number of nodes are 83380. So this is as far as we can go with this settings. Now, kindly suggest how should I fill rest of the 70 node embeddings I was planing to replicate the last few embeddings so to make it to a full 83380 embeddings and then probably train the model with it. Please comment your views.

hosseinfani commented 1 year ago

@karan96

karan96 commented 1 year ago

@hosseinfani I did the check there is no bug, the code is working as expected. In fact, the code is almost the same as given on stellargraph readme and from Radin's work. Plus here are few suggestions from internet which I did as well: -

To ensure that embeddings are generated for all nodes, you can try the following steps:

hosseinfani commented 1 year ago

I'm saying that how do you know which node is missing the embedding? what is the mapping between the nodeid and embedding rowid?

karan96 commented 1 year ago

Greetings @hosseinfani ,

Here are the execution results of 4 GNN models: -

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

  | BNN_EMB_GNN_LOC_META | BNN_EMB_GNN_META | BNN_EMB_GNN_LOC | BNN_EMB_GNN -- | -- | -- | -- | -- P_2 | 0.005458207 | 0.008181269 | 0.007121853 | 0.010723061 P_5 | 0.00432145 | 0.007413494 | 0.006809265 | 0.009609668 P_10 | 0.003795368 | 0.005989124 | 0.005849748 | 0.007568983 recall_2 | 0.002423175 | 0.003984825 | 0.002956208 | 0.005282661 recall_5 | 0.004677651 | 0.008583199 | 0.007491775 | 0.011071126 recall_10 | 0.008143815 | 0.013528369 | 0.012680149 | 0.017002206 ndcg_cut_2 | 0.005663308 | 0.008191296 | 0.007240356 | 0.010867088 ndcg_cut_5 | 0.005358375 | 0.008873825 | 0.007864861 | 0.011583322 ndcg_cut_10 | 0.006796761 | 0.01098033 | 0.009982635 | 0.014090142 map_cut_2 | 0.00203048 | 0.003070156 | 0.002457458 | 0.004061078 map_cut_5 | 0.002806325 | 0.004521555 | 0.003916646 | 0.005879638 map_cut_10 | 0.003530212 | 0.005489386 | 0.004970012 | 0.007096578 aucroc | 0.571481506 | 0.734964306 | 0.628691797 | 0.759530091

Points to Note: -

  1. GNN Without Meta-Paths does give overall better results than the models having embeddings with metapaths in it.
  2. GNN Emb without Loc and just the skill is the best performer model out of all the four models we trained.

Best, Karan

karan96 commented 1 year ago

@hosseinfani , This is what the data currently looks like: - Let me know If we wanna improve the graph anyways:. Here is a small piece of code that i used to generate this. The file contains records of experts to Location and hence it was easier to generate graph from this file. The data is clearly skewed. I will work on synthetically adding data to other locations to make it symmetrical.

plot_skew((df.groupby(['loc']).size().reset_index(name='counts'))) plt.bar(df['loc'], df['counts'], width=0.6)

Skewness of Experts in Locations

karan96 commented 1 year ago

@hosseinfani This is what the data now looks like after adding location synthetically while keeping the number of teams same. I will proceed to train the model with this data.

Skewness of Experts in Locations

karan96 commented 1 year ago

@hosseinfani I have started the training of the next 4 models based on the updated synthetically generated data for locations.

karan96 commented 1 year ago

@hosseinfani Here are the results of the two meta paths model: - one with with loc and one without it. The rest of the two models are currently running and will update you the results soon.

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

  | BNN_EMB_GNN_LOC_META | BNN_EMB_GNN_META -- | -- | -- P_2 | 0.010513595 | 0.004712991 P_5 | 0.009029607 | 0.004197382 P_10 | 0.007216918 | 0.003598792 recall_2 | 0.004977132 | 0.00230482 recall_5 | 0.010437657 | 0.004999704 recall_10 | 0.016568125 | 0.00847207 ndcg_cut_2 | 0.010502656 | 0.004667413 ndcg_cut_5 | 0.010879248 | 0.00502097 ndcg_cut_10 | 0.013424437 | 0.00656301 map_cut_2 | 0.00381165 | 0.001721471 map_cut_5 | 0.005511156 | 0.002520231 map_cut_10 | 0.006587802 | 0.003084576 aucroc | 0.769427051 | 0.684048404

hosseinfani commented 1 year ago

@karan96 Thank you. So, our hypothesis that highly-skewed distribution of loc over teams has negative contribution on results is somehow true. Please update me when you got the results without meta path. Also, rerun the distribution figure on other granularities: city and province.

karan96 commented 1 year ago

@hosseinfani The location when we include province gets bad. What I mean is there are more nan values in the indexes than actual locations themselves. I randomly checked some records and found that the data itself has blank values for the field state. The indexes' i2l indicates that now there are around 123 locations as compared to 70 earlier. Out of these 123 locations 70 of them have nan in them meaning no state present. Code used to extract that: - sum(1 for k in l2i.keys() if 'nan' in k) = 70. This indicates that the USPT dataset is having majority of the blank values for the field state or the values for state is missing, making further experimentation redundant. I'll proceed with the city granularity and see if that can be used for further experimentation. Kindly comment your views.

karan96 commented 1 year ago

@hosseinfani A push was made into USPT branch for all GNN Implementations. Let me know if you face any issues while running the code and I will fix it. https://github.com/fani-lab/OpeNTF/commit/390ead3713b14eef1e0a85f0079de7e9531d7c9a

karan96 commented 1 year ago

@hosseinfani This is how the locations when including cities are spread across the dataset.

image

karan96 commented 1 year ago

@hosseinfani And this is the one when the synthetic data is added: - image

karan96 commented 1 year ago

@hosseinfani Rest of the results of the performed experimentation: -

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

  | BNN_EMB_GNN_LOC | BNN_EMB_GNN -- | -- | -- P_2 | 0.012862034 | 0.013224572 P_5 | 0.01056999 | 0.010590937 P_10 | 0.008132931 | 0.008261027 recall_2 | 0.006192327 | 0.006612731 recall_5 | 0.011897092 | 0.012068724 recall_10 | 0.017740504 | 0.018309857 ndcg_cut_2 | 0.013223923 | 0.014014893 ndcg_cut_5 | 0.013226362 | 0.013720371 ndcg_cut_10 | 0.015547957 | 0.016272465 map_cut_2 | 0.005030282 | 0.005593921 map_cut_5 | 0.006981463 | 0.007414413 map_cut_10 | 0.008176114 | 0.008641789 aucroc | 0.792385037 | 0.794842819