fani-lab / OpeNTF

Neural machine learning methods for Team Formation problem.
Other
18 stars 13 forks source link

OpenNMT for Team Formation on Sparse Matrix #79

Open hosseinfani opened 2 years ago

hosseinfani commented 2 years ago

@lilizhu1 @xcyyygithub @Pax636 Now that you could successfully run the opennmt for its sample en-de translation, we want you to run it on our sparse matrix. So, given a sparse matrix each row of which is a team like:

[id][s0,s1,s2,s3,s4,s5,s6,s7] [m1,m2,m3,m4,m5]
[5 ][0 , 0, 0, 1, 0, 0, 1, 0] [1 , 1, 0, 0, 1] #we have a team with id=5, where the skills are {s3,s6} and the members are {m1,m2,m5}

We want to train a translator between skills and members. This translation assumes that the source language talks in numbers of skill indexes and the target language is in the member indexes.

So, the sparse matrix is already available for dblp dataset (teams of researchers) in ./data/preprocessed/dblp/toy.dblp.v12.json/teamsvecs.pkl

You have to open the file and load the sparse matrix like in: https://github.com/fani-lab/neural_team_formation/blob/97b9fc88f3bee2dc454ed22b51b561b81a3a0881/src/cmn/team.py#L91

Then, write python code to generate the pairs of (skill => member) as a dataset for opennmt library.

Please let me know if you have any questions here.

willwang636 commented 2 years ago

How can I properly read the pickle in a text form? I'm a bit confused about how to manipulate the result of pickle.load(). If I simply use print(np.matrix(teams)) from numpy it gives blank result.

rpkl

here is the code I've been practicing, I put it in the same folder with pickled data, and import it. https://github.com/Pax636/4990_IMDB/blob/main/readpkl.py

Thank you,

hosseinfani commented 2 years ago

Hi, Seems the problem is with the file or the code that created the pickle file is old. I will send you an updated pickle file.

hosseinfani commented 2 years ago

@Pax636 I checked the files. they're ok. Make sure you're using the files in the following path:

https://github.com/fani-lab/neural_team_formation/tree/main/data/preprocessed/dblp/toy.dblp.v12.json

willwang636 commented 2 years ago

Hi, @hosseinfani Do you know what might causes a problem of 'ModuleNotFoundError: No module named 'cmn', when trying to call pickle.load() load

Thanks,

hosseinfani commented 2 years ago

Hi @Pax636 the teams.pkl is the objects that are serialized into binary. The class definition of the objects are in the cmn.

You don't need to open this file. You need the sparse matrix teamsvecs.pkl and probably indexes.pkl

hosseinfani commented 2 years ago

How to split a range of row numbers into three parts randomly:

https://www.malicksarr.com/split-train-test-validation-python/ https://github.com/fani-lab/neural_team_formation/blob/daf2a93753175bbb8b5d2d83e4ae6c4f6655a822/src/main.py#L13

willwang636 commented 2 years ago

Our code generate the pairs of (skill => member) as a dataset from ./data/preprocessed/dblp/toy.dblp.v12.json/teamsvecs.pkl

The code is here https://github.com/Pax636/4990_IMDB/blob/main/rpk.py

The results are src-test.txt and tgt-test.txt: result

We haven't test the code on the large dataset, and don't know where to find the complete dataset. Would you mind providing the source about it?

Now we are working on splitting the src-test.txt and tgt-test.txt.

willwang636 commented 2 years ago

successfully splitting the file, then got stuck on adding content to specific row in newly created src.train.txt file. Problem occurs at line 129 from https://github.com/Pax636/4990_IMDB/blob/main/split.py

A result of splitting src-test.txt : split

hosseinfani commented 2 years ago

@Pax636 Let's call src-test.txt and tgt-test.txt => src.dataset.txt and tgt.dataset.txt Then split them into three subsets:

1) src.train.txt and tgt.train.txt 2) src.valid.txt and tgt.valid.txt (I think opennmt needs this, right?) 3) src.test.txt and tgt.test.txt

About the last picture, why did you split into test1 and test2? Also, the second line of logs shows the index of the numbers, right? If so, you don't need them.

About splitting the actual content based on these numbers, you can read the actual src.dataset.txt and tgt.dataset.txt into a panda's dataframe and pass the numbers in each split as an index to iloc. For example:

import pandas as pd
train_split_idx = [1,5,15, 19]

df_src = pd.read_csv('src.dataset.txt')
df_train = df_src.iloc[[train_split_idx]]
df_train.to_csv('src.train.txt')
willwang636 commented 2 years ago

About test1 and test2, I thought we need to split src-test.txt into a structure like, https://www.malicksarr.com/split-train-test-validation-python/ also mentions that: there are three parts(train, test, and validation sets). 7030

The second line of logs if you mean these lines, lines

They are rows of indexes of skills in src-test.txt, I put it all in an array, srcblue

What I was doing was putting indexes of skills into one array and having another array of ids, then splitting them together with train_70X, test_30X, train_70Y, test_30Y= train_test_split(id, skill, train_size=0.7, random_state=7, shuffle=True) to remember the index of rows, and put it back into 3 separate text files later. idskill

I will try to use panda's data-frame since it can read data directly from text file, src-test.txt

Thanks,

hosseinfani commented 2 years ago

@Pax636 You did right but the main file is the dataset file: image

About the second part of your message, you don't need to do so. Yes, try the pandas.DataFrame and pass the randomly splitter rowids to it. It selects those rows only. here is an example:

image

xcyyygithub commented 2 years ago

We have finished our code for splitting the src.dataset.txt, here is the code, https://github.com/Pax636/4990_IMDB/blob/main/split.py The three files are like these: resultfromsplit

First of all, we need to split the dataset to three parts of train, test, and validation sets.

b4042cec2e09571fce4ee8e6738b943 6be92e345559f1fbceecbff21ad60a8

Then based on your suggestions of pandas, we can collect the id and indexs of skills of full dataset.

ba7e94eb380c85fff93bc131c4eb22e 1b03bed11c45c69f906a15f1aba1bdb

After that, we can generate arrays to store these indexs of their locations for each part.

4707036a4c7eb35e7e3dc8349e8665c

In the end, we need to generate three files for the three parts. 6d8371f409e71e2a57139afe1f01b74

734e333cc3c4a869b61d5dd0823747f
willwang636 commented 2 years ago

Now the code is able to generate src.dataset.txt, src.train.txt, src.valid.txt, src.test.txt, tgt.dataset.txt, tgt.train.txt, tgt.valid.txt, tgt.test.txt, and three indexes text files, index0_70.txt, index1_15.txt, index2_15.txt.

Here is the code: https://github.com/Pax636/4990_IMDB/blob/main/split.py

A sample result for index0_70.txt, index1_15.txt, index2_15.txt, src.train.txt, src.valid.txt, src.test.txt, and src.dataset.txt: better

Similar result for the tgt files.

hosseinfani commented 2 years ago

@Pax636 You nailed it. Thank you very much. This is what I wanted. So, please go ahead and run it on the entire dataset for dblp.

hosseinfani commented 2 years ago

@Pax636 Oh, one more thing. I forgot to mention this in our meeting. In order to make a difference between skill numbers and member numbers, add the 's' character to the skill numbers in src..txt and 'm' to the member numbers in tgt..txt.

Make sense?

willwang636 commented 2 years ago

If this is what you expect, similar result for the tgt..txt files, but it is 'm' instead of 's', plusS

we will run it on the entire dataset later.

lilizhu1 commented 2 years ago

We succeeded in creating these txt files, but we found some empty lines in src.dataset.txt causing the index to not point to the correct number of lines. For example, the first blank line appears in src.dataset.txt at line 3439, so the index works correctly until line 3439. 3435 3435+1=3436. But after 3439, all indexes have to be +2 to point to the data in src.dataset.txt correctly. The second empty line appears at line 49957 of src.dataset.txt, so after that, all indexes have to be +3 to correctly point to the data in src.dataset.txt, and so on. After our query, the index 3439 corresponds to the 3438th row in the database is indeed empty data. 3438c

After counting, there are a total of 85 empty lines in src.dataset.txt, and no empty lines in tgt.dataset.txt, this means that when the skill ->member, but the skill is empty, why? I would like to ask how do we deal with such empty rows? Is it a defect of the database?

hosseinfani commented 2 years ago

@lilizhu1 @Pax636 @xcyyygithub Ideally, we should not have a team with empty skills! Till we find the root of the problem, can you please also create the files for the other filtered datasets: the ones that have .filtered. in the filename and see we have similar issue? thank you.

@VaghehDashti @karan96 Can you please check what the problem is?

lilizhu1 commented 2 years ago

We created files for all three databases and concluded that all three databases have some empty lines in src.dataset.txt and no empty lines in tgt.dataset.txt. dblpv12src dblpv12tgt dblpv12filtsrc dblpv12filttgt dblpv12filtm10src dblpv12filtm10tgt

hosseinfani commented 2 years ago

@lilizhu1 @Pax636 @xcyyygithub I opened a bug issue and @VaghehDashti will work on it.

We can assume these lines are noise in dataset (empty sentence in source language is getting translated) Please do not stop. While we're trying to solve the bug, you go ahead and train the opennmt model on your generated files on the main dblp dataset (./dblp.v12.json/teamsvecs.pkl) Thank you.

willwang636 commented 2 years ago

The model can be produced without error, but translation gives irrelevant result which everything are replaced with <unk>

FR

The result of translation in pred_26675.txt: FR2

The vocab seems fine, vacabs

hosseinfani commented 2 years ago

Are the vocab files are generated by the library?

willwang636 commented 2 years ago

Yes, by using this

onmt_build_vocab -config toy_en_de.yaml -n_sample 10000

With 10000 replaced by total number of rows in src.train.txt

hosseinfani commented 2 years ago

weird then! At first I thought that the lib uses a regex that filters numbers.

So, the only think I guess is that the lib uses the word embeddings for words. When it cannot find an embedding for words, it replace it with .

We have the embeddings of our skills and members. I will share it with you. But you have to figure it out how the lib inject the embeddings. Then you have to replace it with our own embeddings.

For now, let's stop it and you focus on your exams. We'll continue after.

hosseinfani commented 2 years ago

@lilizhu1 @Pax636 @xcyyygithub Here is the summary and steps of work to do:

As I said, I think the problem is that the opennmt library uses pre-trained vectors (also know as embeddings) for english/german words. Obviously, our artificial words for skills ('s' followed by id like s20) and for members ('m' followed by id like m30) are not in english/german language.

So, we have to generate our own vectors for skill and members words. I already did that for you using an algorithm called Doc2Vec which uses another algorithm Word2Vec. The library I used for this is called Gensim.

Accordingly, these are the steps to do:

  1. Find the place in opennmt that tries to fetch the vectors for input words
  2. Change the opennmt code such that it fetches the vectors from our vectors, not theirs
  3. Train and test opennmt again and see the prediction file.

Where are our vectors? I already share a rar file that includes many files. Among the files, there are four files that start with **joint.emb.d100.***

you can use this line of code to load the file:

>>> import gensim
>>> vectors = gensim.models.Doc2Vec.load('joint.emb.d100.w1.dm1.mdl')
>>> vectors['s1']
array([-0.21104698,  0.44721982,  1.649013  , -0.26125607, -1.2575299 ,  1.0593919 ,  1.4415383 , -0.4703626 , -0.30133364,  .75287   ,        ... ],       dtype=float32)
>>> vectors['m1']
array([-0.08647262,  0.11245123, -0.02685132,  0.0871749 , -0.07743559,       -0.04030139,  0.03272053, -0.07359471, 0.07803687, -0.10604394,        ...],       dtype=float32)

As you see, each vector will have 100 real values in [-1, +1] range.

Also, you don't need to load all four files. You only need to load the main file and the gensim library loads others automatically.

Hope this solve our problem. Meanwhile, let me know of any issue here

willwang636 commented 2 years ago

we think we found the file that I need to edit, it is in OpenNMT-py/onmt/modules/embeddings.py at line 265.

For one thing I'm not sure about is that the files start with joint.emb.d100 you shared, are these files so called Pretrained embeddings or just embeddings?

what we are going to do next is to change the method at line 265 completely since it read as Glove format(seems a txt format), so we need to change it to read the content in the model you shared with us, for now, we are not sure if it is the right path to go on.

hosseinfani commented 2 years ago

You mean this line I believe: https://github.com/OpenNMT/OpenNMT-py/blob/908881444a814f7ab05c77c14316a2f05d135960/onmt/modules/embeddings.py#L265

Embeddings are vectors that either we have to create or it's already available. When they're already available, we refer to them as pre-trained embeddings. So, in our case, you already have the pretrained embeddings.

If during the actual training of opennmt, it calls this function (I believe so but you have to check), yes, we're in the right track. Hopefully, this solves our problem. We have to see the results though.

willwang636 commented 2 years ago

when I was trying to use vectors = gensim.models.Doc2Vec.load('joint.emb.d100.w1.dm1.mdl') to load the model, it gives me AttributeError: 'Doc2Vec' object has no attribute 'dv'

It might be caused by using a different genism version, in case of that situation, could you tell me which version of gensim are you using for loadding?

hosseinfani commented 2 years ago

Oh, gensim==3.8.3 from here: https://github.com/fani-lab/neural_team_formation/blob/4c1d0193704f261be53968d62050f023e277be45/requirements.txt#L11

willwang636 commented 2 years ago

Sorry, the only package I couldn't install is this pytrec_eval==0.5

It gives this error message, log1

hosseinfani commented 2 years ago

don't try to install all packages in the file. Only install gensim with that version.

willwang636 commented 2 years ago

yes, I installed them sepreately, and gensim is also installed, but it doesn't work when I tried to install pytrec_eval.

xcyyygithub commented 2 years ago

Just like Zixun mentioned, when we install pytrec_eval==0.5, always has the ERROR: Command errored out with exit status 1: ... ... All of us have this problem. We don't know whether it is incompatible with Windows system? Or some other reasons? Then we are confused with the command model[],could you tell us how to use model['s1'] or model['m1']. We were stuck with this command and could not continue. Thank you, professor. Have a good weekend.

hosseinfani commented 2 years ago

@xcyyygithub As I said above, you don't need that library. Forget about it. Oh, my bad. I meant vectors['m1'] or vectors['s1']. The variable that the file is loaded is called vectors. You can name it whatever.

willwang636 commented 2 years ago

The joint.emb.d100.w1.dm1.mdl has been converted to a .txt word2vec format which is a similar format to GloVe that OpenNMT also supported but with skipping the first line, The .txt file was needed in the method at line 265 which is at here https://github.com/OpenNMT/OpenNMT-py/blob/908881444a814f7ab05c77c14316a2f05d135960/onmt/modules/embeddings.py#L265.

The joint.emb.d100.w1.dm1.mdl is converted by using vectors.save_word2vec_format("model.txt", binary=False) from genism.

Here is a example of GloVe's .txt format used in OpenNMT. 2

Here is the word2vec .txt file converted from joint.emb.d100.w1.dm1.mdl 1

It seems the only different is an extra first line in the word2vec, and it was also discussed here in the OpenNMT issue page. https://github.com/OpenNMT/OpenNMT-py/pull/580

Failed to directly use doc2vec model you provided. OpenNMT uses a dictionary structure( emb = dict() ) to extract and save the data from GloVe's .txt format. I failed to find a way to convert a doc2vec model to dic() structure, then came up with the idea of converting the .mdll to .txt as well. No other way around yet found.

My computer is running the training right now, I will give a report tomorrow.

*Update: It still gives <unk> for unknown reason

hosseinfani commented 2 years ago

@Pax636 Thank you for the update. Can you please your code and steps you take to train the opennmt. I have to debug opennmt and see what's going wrong :(

Also, you can do it too. Hopefully, we could find the problem.

thanks.

willwang636 commented 2 years ago

I think something wrong must happened during training process. I did some test with small amount of data, it doesn't give <unk> anymore.

After I got the <unk> in the result, I tried to decrease the size of src.train.txt and tgt.train.txt to 70 lines instead of 3414168 lines previously, and also keep the other 15 lines in src.valid.txt and tgt.valid.txt, then keep the remaining 15 lines in src.test.txt and tgt.test.txt. All those files are split by the code we've done before.

Here is the result after translating with the model that trained with the above small files. Somehow it gives blank but no <unk> c

The steps to have the above result in the image.

  1. Manually select and copy the first 100 lines from src.train.txt and tgt.train.txt. ( I'm uploading the 6 .txt files I spit already from dataset including these two here on Teams ). Then create and paste them to src.dataset.txt and tgt.dataset.txt accordingly.

  2. Run this split_test.py code here https://github.com/Pax636/4990_IMDB/blob/main/split_test.py by import the code and run it's method. It should be placed with src.dataset.txt and tgt.dataset.txt together in the same folder.

  3. Then you will have the 6 txt files needed for training on OpenNMT, link them in the config file toy_en_de.yaml. Here is what I wrote, you should change the addresses. https://github.com/Pax636/4990_IMDB/blob/main/toy_en_de.yaml

  4. The model.txt in the config file is created by two lines of code, you might created a python file and include theses two: embs = gensim.models.Doc2Vec.load(joint.emb.d100.w1.dm1.mdl) embs.save_word2vec_format("model.txt", binary=False)

my file is here, https://github.com/Pax636/4990_IMDB/blob/main/load.py

The model.txt is a 5.9 GB txt file, it takes a while to create, and I'm also uploading it to Teams.

  1. Run the comment for training OpenNMT, I followed the steps here with a little change, https://github.com/OpenNMT/OpenNMT-py onmt_build_vocab -config toy_en_de.yaml -n_sample 100 onmt_train -config toy_en_de.yaml onmt_translate -model toy-ende/run/model_step_5.pt -src toy-ende/src.test.txt -output toy-ende/pred_100.txt -gpu 0 -verbose

Then you should get the same result of my image, sorry we did't solve it. Let me know if anything.

hosseinfani commented 2 years ago

@Pax636 I really like your detailed logs/reports. Thank you! I'm busy with some deadlines. Please give me some time. I will try to reproduce your result and see what the problem may be.

thangk commented 3 months ago

The past two days I ran nmt with toy.dblp.v12.json and each time the predictions come out . Upon searching for answers, I suspect, it could be due to the small sample size of the toy file.

fold2 results image

predictions are image

Unfortunately, I cannot view Pax636's code linked to their GitHub repo.

@hosseinfani , is this the current status for you also when running with the toy data?

hosseinfani commented 3 months ago

@thangk have a look at https://github.com/fani-lab/OpeNTF/issues/79#issuecomment-1018634296

@jamil2388 any idea about this

thangk commented 3 months ago

@thangk have a look at #79 (comment)

@jamil2388 any idea about this

Thanks.

I already share a rar file that includes many files. Among the files, there are four files that start with joint.emb.d100.*

I can't find this rar file you're referring to anywhere in this repo. Is it somewhere else?

jamil2388 commented 3 months ago

I already talked to Kap about the summary of the process. We need to generate the word2vec embedding files for the toy data first, then feed those files to the model to let it know about the pretrained embeddings. When I tried to work on that, I did not use these files that is mentioned as a rar file here. I recommended him to generate the word2vec embeddings or any other embedding by himself to be clear of the approach.

Get Outlook for Androidhttps://aka.ms/AAb9ysg


From: Kap Thang @.> Sent: Thursday, May 16, 2024 4:31:24 PM To: fani-lab/OpeNTF @.> Cc: Md Jamil Ahmed @.>; Mention @.> Subject: Re: [fani-lab/OpeNTF] OpenNMT for Team Formation on Sparse Matrix (Issue #79)

@thangkhttps://github.com/thangk have a look at #79 (comment)https://github.com/fani-lab/OpeNTF/issues/79#issuecomment-1018634296

@jamil2388https://github.com/jamil2388 any idea about this

Thanks.

I already share a rar file that includes many files. Among the files, there are four files that start with joint.emb.d100.*

I can't find this rar file you're referring to anywhere in this repo. Is it somewhere else?

— Reply to this email directly, view it on GitHubhttps://github.com/fani-lab/OpeNTF/issues/79#issuecomment-2116126373, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGWT6S3TL4JL32CAUVLYAQ3ZCUJRZAVCNFSM5H5G32DKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMJRGYYTENRTG4ZQ. You are receiving this because you were mentioned.Message ID: @.***>

thangk commented 3 months ago

@hosseinfani

After sucessfully installing the deps required using conda, I was able to run the main.py in team2vec and re-ran the nmt model. This time, it produced sensible results.

image

hosseinfani commented 3 months ago

@thangk good! two things:

So, now, it's time to run on entire datasets.

thangk commented 3 months ago

So, now, it's time to run on entire datasets.

I ran it with the entire dataset of dblp last night but I had errors. I'll look into it today.

image

hosseinfani commented 3 months ago

@thangk The error is related to creating the sparse matrix in parallel.

While looking into fixing the issue, we have the sparse matrices for the entire datasets in OpeNTF channel. You can use them meanwhile

thangk commented 3 months ago

I've downloaded the teamsvecs.pkl from OpeNTF channel and use it with the main main.py and I think it's trying to place a lot of the data it's working on, onto the memory.

image

hosseinfani commented 3 months ago

@thangk so, I add you to our server channel. put it on server.

thangk commented 3 months ago

I've transferred the project file onto the matrix server and now attempting to run on the full dataset. But first testing it out on the toy dataset.