Issue in Implementation of Naive Approach for Location

karan96 commented 2 years ago

Greetings @hosseinfani ,

I was trying to create the one-hot encoded vector of locations in order to use them for my naive-based approach using NN. In that I had the following doubt and am stuck at this step: - How do I create a one-hot encoded array with the below-given structure of l2m. Skills are connected to 1 patent and locations might be shared between members of a team. Thus creating ambiguity and thus the existing logic can not be utilized for one-hot encoding locations and members. Any help would be appreciated.

This is m2l len = 24 {'fl:j_ln:lynch-38_Jay_Lynch': ('Arvada', 'CO', 'US'), ' fl:a_ln:braunberger-1_Alfred S._Braunberger': ('Sequim', 'WA', 'US'), ' fl:b_ln:braunberger-2_Beau M._Braunberger': ('Upland', 'CA', 'US'), 'fl:a_ln:sessa-1_Anthony J._Sessa': ('Quogue', 'NY', 'US'), 'fl:j_ln:albertson-4_Jacob_Albertson': ('Newton', 'MA', 'US'), 'fl:k_ln:arnold-2_Kenneth C. R. C._Arnold': ('Ellicott City', 'MD', 'US'), 'fl:m_ln:paolini-3_Michael A._Paolini': ('Austin', 'TX', 'US'), 'fl:s_ln:goldman-15_Steven D._Goldman': ('County of Chesterfield', 'MO', 'US'), 'fl:r_ln:williams-203_Richard C._Williams': ('Saratoga', 'CA', 'US'), 'fl:m_ln:culbert-1_Michael_Culbert': ('Monte Sereno', 'CA', 'US'), 'fl:k_ln:cox-31_Keith_Cox': ('Sunnyvale', 'CA', 'US'), 'fl:j_ln:de cesare-2_Josh P._de Cesare': ('Campbell', 'CA', 'US'), 'fl:d_ln:radcliffe-4_David_Radcliffe': ('Hood River', 'OR', 'US'), 'fl:d_ln:huang-8_Daisie Iris_Huang': ('Oakland`', 'CA', 'US'), 'fl:d_ln:falkenburg-1_Dave Robbins_Falkenburg': ('San Jose Almadeu', 'CA', 'US'), 'fl:b_ln:howard-16_Brian D._Howard': ('Portola Valley', 'CA', 'US'), 'fl:g_ln:freeman-9_Gary A._Freeman': ('Waltham', 'MA', 'US'), 'fl:j_ln:brewer-17_James E._Brewer': ('Lino Lakes', 'MN', 'US'), 'fl:b_ln:gade-3_Bhargavaram B._Gade': ('Irving', 'TX', 'US'), 'fl:m_ln:petach-2_Matthew Nicholas_Petach': ('San Jose', 'CA', 'US'), 'fl:s_ln:prathaban-3_Selvaraj Rameshwara_Prathaban': ('Coimbastore', nan, 'IN'), 'fl:j_ln:albus-3_James S._Albus': ('Kensington', 'MD', 'US'), '4o6g2rskzeyvnezyqwpxamwdz_Joannes G._van den Hanenberg': ('PT Eindhoven', nan, 'NL'), 'nn9x8usd2khsn0yk36c3g0vzm_Frederikus J._de Munnik': ('PT Eindhoven', nan, 'NL')}

This is l2m len = 23 {('Arvada', 'CO', 'US'): ['fl:j_ln:lynch-38_Jay_Lynch'], ('Sequim', 'WA', 'US'): ['fl:a_ln:braunberger-1_Alfred S._Braunberger'], ('Upland', 'CA', 'US'): ['fl:b_ln:braunberger-2_Beau M._Braunberger'], ('Quogue', 'NY', 'US'): ['fl:a_ln:sessa-1_Anthony J._Sessa'], ('Newton', 'MA', 'US'): ['fl:j_ln:albertson-4_Jacob_Albertson'], ('Ellicott City', 'MD', 'US'): ['fl:k_ln:arnold-2_Kenneth C. R. C._Arnold'], ('Austin', 'TX', 'US'): ['fl:m_ln:paolini-3_Michael A._Paolini'], ('County of Chesterfield', 'MO', 'US'): ['fl:s_ln:goldman-15_Steven D._Goldman'], ('Saratoga', 'CA', 'US'): ['fl:r_ln:williams-203_Richard C._Williams'], ('Monte Sereno', 'CA', 'US'): ['fl:m_ln:culbert-1_Michael_Culbert'], ('Sunnyvale', 'CA', 'US'): ['fl:k_ln:cox-31_Keith_Cox'], ('Campbell', 'CA', 'US'): ['fl:j_ln:de cesare-2_Josh P._de Cesare'], ('Hood River', 'OR', 'US'): ['fl:d_ln:radcliffe-4_David_Radcliffe'], ('Oakland`', 'CA', 'US'): ['fl:d_ln:huang-8_Daisie Iris_Huang'], ('San Jose Almadeu', 'CA', 'US'): ['fl:d_ln:falkenburg-1_Dave Robbins_Falkenburg'], ('Portola Valley', 'CA', 'US'): ['fl:b_ln:howard-16_Brian D._Howard'], ('Waltham', 'MA', 'US'): ['fl:g_ln:freeman-9_Gary A._Freeman'], ('Lino Lakes', 'MN', 'US'): ['fl:j_ln:brewer-17_James E._Brewer'], ('Irving', 'TX', 'US'): ['fl:b_ln:gade-3_Bhargavaram B._Gade'], ('San Jose', 'CA', 'US'): ['fl:m_ln:petach-2_Matthew Nicholas_Petach'], ('Coimbastore', nan, 'IN'): ['fl:s_ln:prathaban-3_Selvaraj Rameshwara_Prathaban'], ('Kensington', 'MD', 'US'): ['fl:j_ln:albus-3_James S._Albus'], ('PT Eindhoven', nan, 'NL'): None}

hosseinfani commented 2 years ago

Hi @karan96 given a team (patent), you can have a vector of size |all unique locations|, and then make those elements for the members' location 1. So, it's not one-hot but it's an occurrence vector. For example,

p1 = (<s1,s2>,<m1-l3, m2-l5, m3-l9>) ==> input [skills:[0,1,1,0,0,...], locations:[0,0,0,1,0,1,0,0,..,1,0,0,..]] to output [members:[0,1,1,1,0,0,...]]

I started the indexes from 0.

karan96 commented 2 years ago

Hi @hosseinfani , I did try implementing the idea you gave above and came up with a following occurrence vector: - [[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]

Is this what we are looking for? The code I am using to generate this is as follows: -

              for ixl, val in enumerate(list(indexes['i2l'].keys())):
                  if val in teams[values].members_details:
                      x[ix, ixl] = 1
                  elif np.nan in val:
                      x[ix, ixl] = 1

The above code generates the array I mentioned, but ignores the records that has nan in it. This nan is coming from the raw dataset where state of a particular member is unknown. For example, the last row of the array is all 0s whereas it should have two 1s towards the end. The value this code is ignoring is: - [('PT Eindhoven', nan, 'NL'), ('PT Eindhoven', nan, 'NL')]. In line number 3 of the above code I am trying to find val(location) in a team's location, since we cannot match nan with nan, I believe that is why there are 0 for the last row. My question is: - 1. Is this the format we are looking for, for our occurrence vector? 2. How do I handle such records having nan in them?

Here each row represents one team and each column represents the respective team's location.

I hope I was able to explain my doubt.

karan96 commented 2 years ago

@hosseinfani Hi Dr. Fani, What if I label encode each location? for example, right now we have location in the form of a tuple for each member: - ('Campbell', 'CA', 'US') and we turn this into Campbell_CA_US. This will help us remove the ambiguity that we are facing since we need to keep in mind the duplication of location for each member within the same team. If we do this, then each location will be unique for each member and even if they duplicate we don't have to re-code it again for that. We can use the existing logic of one-hot encoding the skills w.r.t members once we have done label encoding. Kindly let me know your views on this.

hosseinfani commented 2 years ago

@karan96 Hi Karan, tomorrow, we'll review and discuss your code in the lab. I'm thinking of different granularities: city, state, and country. we can experiment on them individually. for now, let's complete the state level.

karan96 commented 2 years ago

Greetings Dr. @hosseinfani , I have one question, I am generating x(for location) and y(for member) as lil matrices. When I am stacking those horizontally, the resulting matrix is also a lil matrix. Do I need to perform bucketing here? Is it even required? The bucketing was done to ensure the dense array does not take up much space and as soon as we hit the bucketing mark, it assigns it to the lil matrix, which is a sparse matrix. Since in my case there is no dense matrix involved and assignment is of team id, location and members are all done using lil matrices, theoretically, the size should not be of problem right? I will test it in lab tomorrow as well.

hosseinfani commented 2 years ago

@karan96 As I explained before, the final matrix is sparse. Filling the final matrix row by row is time consuming when it is sparse. So, we create a small dense mateix as the bucket. Filling the tows of the bucket is fast because it's dense.

The bucket is not to tackle the memory space but to address timing for filling the final sparse matrix.

karan96 commented 2 years ago

@hosseinfani This is the result of the entire code run on toy dataset. This means that the code is running. But yesterday the code was stuck at creating pickle file for vecs. I'll see what's the reason today.

karan96 commented 2 years ago

@hosseinfani Hello Dr. Fani,

I have tried multiple runs of FNN model with No Negative Sampling for our Naive Implemention on Compute Canada Cluster but everytime I see that the code is stuck at this step. I am not sure why this is happening. I did verify the settings with Arman and nothing is different he says his code is running perfectly fine. I am not requesting a lot of resources or demanding lots of GPU nodes.

hosseinfani commented 2 years ago

@VaghehDashti You've seen this before and know the reason, I believe. Please help Karan.

@karan96
you don't need gpu for this step. I assume you know what the step is in our pipeline!

VaghehDashti commented 2 years ago

@karan96 Hi, can you share the .out file generated from running the code? (you need to convert it to .txt before uploading) because there are no errors here! As far as I remember, this is the start of sparse matrix generation and it's the log for using the multiple processes. My best guess is that it is throwing an out-of-memory error. I'd suggest using something like this when creating the job: sbatch --account=def-hfani --mem=96000MB --time=4320 cc.sh I don't know why but at least for me some of the arguments for the job on the compute Canada must be done through the shell and it wouldn't read them from the .sh file!

VaghehDashti commented 2 years ago

Also, you are running the pipeline on the uspt dataset, right? I recently tried to run the pipeline on uspt with the following command and it threw an out-of-memory error that you can see in the attached log. slurm-40436007.txt sbatch --account=def-hfani --mem=64000MB --time=2880 cc.sh I'm running it with 96GB to see what happens.

karan96 commented 2 years ago

@VaghehDashti Hi, I have attached the file below. slurm-63776570.txt The error that you might see at the end of this file is because the job ran out of time of 23 hours in this stuck state of teamsvecs.pkl creation. I will try it again while specifying the above parameters.

karan96 commented 2 years ago

Update: - The execution in sharcnet did not complete. I have tested multiple runs of the code in serial. Now the last run I have executed and which is currently running is for 7 days. I'll continue to update here.

karan96 commented 2 years ago

@hosseinfani I was able to resolve the error we were getting on sharcnet for multiprocessing. I did the resolution using: - from multiprocessing import get_context with get_context("spawn").Pool() as p:. Currently the code is at 133000/164225 instances. I will update once it completes.

karan96 commented 2 years ago

@hosseinfani Greetings,

I was able to make progress on running the code for the creation of teamsvecs.pkl file but it is still not getting complete and now it is getting stuck at: -


Loading 29500/164225 instances by <SpawnProcess name='SpawnPoolWorker-42' parent=185944 started daemon>! 25533.12321782112
Loading 30000/164225 instances by <SpawnProcess name='SpawnPoolWorker-42' parent=185944 started daemon>! 25945.008811950684
Loading 30500/164225 instances by <SpawnProcess name='SpawnPoolWorker-42' parent=185944 started daemon>! 26357.911100387573
Loading 31000/164225 instances by <SpawnProcess name='SpawnPoolWorker-42' parent=185944 started daemon>! 26777.38530611992
Loading 31500/164225 instances by <SpawnProcess name='SpawnPoolWorker-42' parent=185944 started daemon>! 27195.9788980484
Loading 32000/164225 instances by <SpawnProcess name='SpawnPoolWorker-42' parent=185944 started daemon>! 27616.99017381668
Loading 32500/164225 instances by <SpawnProcess name='SpawnPoolWorker-42' parent=185944 started daemon>! 28031.741077899933

I tried multiple things like changing the code a bit, increased time, and increased memory but nothing seemed to help and now I require your assistance in this. Kindly let me know a suitable time in lab in which you can connect. Thanks.

hosseinfani commented 2 years ago

@karan96 I'm in the lab today.

fani-lab / OpeNTF

Issue in Implementation of Naive Approach for Location #157