Open hosseinfani opened 3 years ago
How can I properly read the pickle in a text form? I'm a bit confused about how to manipulate the result of pickle.load(). If I simply use print(np.matrix(teams))
from numpy
it gives blank result.
here is the code I've been practicing, I put it in the same folder with pickled data, and import it. https://github.com/Pax636/4990_IMDB/blob/main/readpkl.py
Thank you,
Hi, Seems the problem is with the file or the code that created the pickle file is old. I will send you an updated pickle file.
@Pax636 I checked the files. they're ok. Make sure you're using the files in the following path:
https://github.com/fani-lab/neural_team_formation/tree/main/data/preprocessed/dblp/toy.dblp.v12.json
Hi, @hosseinfani
Do you know what might causes a problem of 'ModuleNotFoundError: No module named 'cmn'
, when trying to call pickle.load()
Thanks,
Hi @Pax636
the teams.pkl is the objects that are serialized into binary. The class definition of the objects are in the cmn
.
You don't need to open this file. You need the sparse matrix teamsvecs.pkl
and probably indexes.pkl
How to split a range of row numbers into three parts randomly:
https://www.malicksarr.com/split-train-test-validation-python/ https://github.com/fani-lab/neural_team_formation/blob/daf2a93753175bbb8b5d2d83e4ae6c4f6655a822/src/main.py#L13
Our code generate the pairs of (skill => member) as a dataset from ./data/preprocessed/dblp/toy.dblp.v12.json/teamsvecs.pkl
The code is here https://github.com/Pax636/4990_IMDB/blob/main/rpk.py
The results are src-test.txt
and tgt-test.txt
:
We haven't test the code on the large dataset, and don't know where to find the complete dataset. Would you mind providing the source about it?
Now we are working on splitting the src-test.txt
and tgt-test.txt
.
successfully splitting the file, then got stuck on adding content to specific row in newly created src.train.txt
file. Problem occurs at line 129 from https://github.com/Pax636/4990_IMDB/blob/main/split.py
A result of splitting src-test.txt
:
@Pax636
Let's call src-test.txt
and tgt-test.txt
=> src.dataset.txt
and tgt.dataset.txt
Then split them into three subsets:
1) src.train.txt
and tgt.train.txt
2) src.valid.txt
and tgt.valid.txt
(I think opennmt needs this, right?)
3) src.test.txt
and tgt.test.txt
About the last picture, why did you split into test1 and test2? Also, the second line of logs shows the index of the numbers, right? If so, you don't need them.
About splitting the actual content based on these numbers, you can read the actual src.dataset.txt and tgt.dataset.txt into a panda's dataframe and pass the numbers in each split as an index to iloc. For example:
import pandas as pd
train_split_idx = [1,5,15, 19]
df_src = pd.read_csv('src.dataset.txt')
df_train = df_src.iloc[[train_split_idx]]
df_train.to_csv('src.train.txt')
About test1 and test2, I thought we need to split src-test.txt
into a structure like,
https://www.malicksarr.com/split-train-test-validation-python/ also mentions that: there are three parts(train, test, and validation sets).
The second line of logs if you mean these lines,
They are rows of indexes of skills in src-test.txt
, I put it all in an array,
What I was doing was putting indexes of skills
into one array and having another array of ids
, then splitting them together with train_70X, test_30X, train_70Y, test_30Y= train_test_split(id, skill, train_size=0.7, random_state=7, shuffle=True)
to remember the index of rows, and put it back into 3 separate text files later.
I will try to use panda's data-frame since it can read data directly from text file, src-test.txt
Thanks,
@Pax636 You did right but the main file is the dataset file:
About the second part of your message, you don't need to do so. Yes, try the pandas.DataFrame and pass the randomly splitter rowids to it. It selects those rows only. here is an example:
We have finished our code for splitting the src.dataset.txt
, here is the code, https://github.com/Pax636/4990_IMDB/blob/main/split.py
The three files are like these:
First of all, we need to split the dataset to three parts of train, test, and validation sets.
Then based on your suggestions of pandas, we can collect the id and indexs of skills of full dataset.
After that, we can generate arrays to store these indexs of their locations for each part.
In the end, we need to generate three files for the three parts.
Now the code is able to generate src.dataset.txt
, src.train.txt
, src.valid.txt
, src.test.txt
, tgt.dataset.txt
, tgt.train.txt
, tgt.valid.txt
, tgt.test.txt
, and three indexes text files, index0_70.txt
, index1_15.txt
, index2_15.txt
.
Here is the code: https://github.com/Pax636/4990_IMDB/blob/main/split.py
A sample result for index0_70.txt
, index1_15.txt
, index2_15.txt
, src.train.txt
, src.valid.txt
, src.test.txt
, and src.dataset.txt
:
Similar result for the tgt files.
@Pax636 You nailed it. Thank you very much. This is what I wanted. So, please go ahead and run it on the entire dataset for dblp.
@Pax636 Oh, one more thing. I forgot to mention this in our meeting. In order to make a difference between skill numbers and member numbers, add the 's' character to the skill numbers in src..txt and 'm' to the member numbers in tgt..txt.
Make sense?
If this is what you expect, similar result for the tgt..txt files, but it is 'm' instead of 's',
we will run it on the entire dataset later.
We succeeded in creating these txt files, but we found some empty lines in src.dataset.txt causing the index to not point to the correct number of lines. For example, the first blank line appears in src.dataset.txt at line 3439, so the index works correctly until line 3439. 3435+1=3436. But after 3439, all indexes have to be +2 to point to the data in src.dataset.txt correctly. The second empty line appears at line 49957 of src.dataset.txt, so after that, all indexes have to be +3 to correctly point to the data in src.dataset.txt, and so on. After our query, the index 3439 corresponds to the 3438th row in the database is indeed empty data.
After counting, there are a total of 85 empty lines in src.dataset.txt, and no empty lines in tgt.dataset.txt, this means that when the skill ->member, but the skill is empty, why? I would like to ask how do we deal with such empty rows? Is it a defect of the database?
@lilizhu1 @Pax636 @xcyyygithub Ideally, we should not have a team with empty skills! Till we find the root of the problem, can you please also create the files for the other filtered datasets: the ones that have .filtered. in the filename and see we have similar issue? thank you.
@VaghehDashti @karan96 Can you please check what the problem is?
We created files for all three databases and concluded that all three databases have some empty lines in src.dataset.txt and no empty lines in tgt.dataset.txt.
@lilizhu1 @Pax636 @xcyyygithub I opened a bug issue and @VaghehDashti will work on it.
We can assume these lines are noise in dataset (empty sentence in source language is getting translated) Please do not stop. While we're trying to solve the bug, you go ahead and train the opennmt model on your generated files on the main dblp dataset (./dblp.v12.json/teamsvecs.pkl) Thank you.
The model can be produced without error, but translation gives irrelevant result which everything are replaced with <unk>
The result of translation in pred_26675.txt:
The vocab seems fine,
Are the vocab files are generated by the library?
Yes, by using this
onmt_build_vocab -config toy_en_de.yaml -n_sample 10000
With 10000 replaced by total number of rows in src.train.txt
weird then! At first I thought that the lib uses a regex that filters numbers.
So, the only think I guess is that the lib uses the word embeddings for words. When it cannot find an embedding for words, it replace it with
We have the embeddings of our skills and members. I will share it with you. But you have to figure it out how the lib inject the embeddings. Then you have to replace it with our own embeddings.
For now, let's stop it and you focus on your exams. We'll continue after.
@lilizhu1 @Pax636 @xcyyygithub Here is the summary and steps of work to do:
As I said, I think the problem is that the opennmt library uses pre-trained vectors (also know as embeddings) for english/german words. Obviously, our artificial words for skills ('s' followed by id like s20) and for members ('m' followed by id like m30) are not in english/german language.
So, we have to generate our own vectors for skill and members words. I already did that for you using an algorithm called Doc2Vec which uses another algorithm Word2Vec. The library I used for this is called Gensim.
Accordingly, these are the steps to do:
Where are our vectors? I already share a rar file that includes many files. Among the files, there are four files that start with **joint.emb.d100.***
you can use this line of code to load the file:
>>> import gensim
>>> vectors = gensim.models.Doc2Vec.load('joint.emb.d100.w1.dm1.mdl')
>>> vectors['s1']
array([-0.21104698, 0.44721982, 1.649013 , -0.26125607, -1.2575299 , 1.0593919 , 1.4415383 , -0.4703626 , -0.30133364, .75287 , ... ], dtype=float32)
>>> vectors['m1']
array([-0.08647262, 0.11245123, -0.02685132, 0.0871749 , -0.07743559, -0.04030139, 0.03272053, -0.07359471, 0.07803687, -0.10604394, ...], dtype=float32)
As you see, each vector will have 100 real values in [-1, +1] range.
Also, you don't need to load all four files. You only need to load the main file and the gensim library loads others automatically.
Hope this solve our problem. Meanwhile, let me know of any issue here
we think we found the file that I need to edit, it is in OpenNMT-py/onmt/modules/embeddings.py
at line 265.
For one thing I'm not sure about is that the files start with joint.emb.d100 you shared, are these files so called Pretrained embeddings or just embeddings?
what we are going to do next is to change the method at line 265 completely since it read as Glove format(seems a txt format), so we need to change it to read the content in the model you shared with us, for now, we are not sure if it is the right path to go on.
You mean this line I believe: https://github.com/OpenNMT/OpenNMT-py/blob/908881444a814f7ab05c77c14316a2f05d135960/onmt/modules/embeddings.py#L265
Embeddings are vectors that either we have to create or it's already available. When they're already available, we refer to them as pre-trained embeddings. So, in our case, you already have the pretrained embeddings.
If during the actual training of opennmt, it calls this function (I believe so but you have to check), yes, we're in the right track. Hopefully, this solves our problem. We have to see the results though.
when I was trying to use vectors = gensim.models.Doc2Vec.load('joint.emb.d100.w1.dm1.mdl')
to load the model, it gives me AttributeError: 'Doc2Vec' object has no attribute 'dv'
It might be caused by using a different genism version, in case of that situation, could you tell me which version of gensim are you using for loadding?
Oh, gensim==3.8.3 from here: https://github.com/fani-lab/neural_team_formation/blob/4c1d0193704f261be53968d62050f023e277be45/requirements.txt#L11
Sorry, the only package I couldn't install is this pytrec_eval==0.5
It gives this error message,
don't try to install all packages in the file. Only install gensim with that version.
yes, I installed them sepreately, and gensim is also installed, but it doesn't work when I tried to install pytrec_eval.
Just like Zixun mentioned, when we install pytrec_eval==0.5, always has the ERROR: Command errored out with exit status 1: ... ... All of us have this problem. We don't know whether it is incompatible with Windows system? Or some other reasons? Then we are confused with the command model[],could you tell us how to use model['s1'] or model['m1']. We were stuck with this command and could not continue. Thank you, professor. Have a good weekend.
@xcyyygithub As I said above, you don't need that library. Forget about it. Oh, my bad. I meant vectors['m1'] or vectors['s1']. The variable that the file is loaded is called vectors. You can name it whatever.
The joint.emb.d100.w1.dm1.mdl
has been converted to a .txt word2vec format which is a similar format to GloVe that OpenNMT also supported but with skipping the first line, The .txt file was needed in the method at line 265 which is at here https://github.com/OpenNMT/OpenNMT-py/blob/908881444a814f7ab05c77c14316a2f05d135960/onmt/modules/embeddings.py#L265.
The joint.emb.d100.w1.dm1.mdl
is converted by using vectors.save_word2vec_format("model.txt", binary=False)
from genism.
Here is a example of GloVe's .txt format used in OpenNMT.
Here is the word2vec .txt file converted from joint.emb.d100.w1.dm1.mdl
It seems the only different is an extra first line in the word2vec, and it was also discussed here in the OpenNMT issue page. https://github.com/OpenNMT/OpenNMT-py/pull/580
Failed to directly use doc2vec model you provided. OpenNMT uses a dictionary structure( emb = dict() ) to extract and save the data from GloVe's .txt format. I failed to find a way to convert a doc2vec model to dic() structure, then came up with the idea of converting the .mdll to .txt as well. No other way around yet found.
My computer is running the training right now, I will give a report tomorrow.
*Update: It still gives <unk>
for unknown reason
@Pax636 Thank you for the update. Can you please your code and steps you take to train the opennmt. I have to debug opennmt and see what's going wrong :(
Also, you can do it too. Hopefully, we could find the problem.
thanks.
I think something wrong must happened during training process. I did some test with small amount of data, it doesn't give <unk>
anymore.
After I got the <unk>
in the result, I tried to decrease the size of src.train.txt and tgt.train.txt to 70 lines instead of 3414168 lines previously, and also keep the other 15 lines in src.valid.txt and tgt.valid.txt, then keep the remaining 15 lines in src.test.txt and tgt.test.txt. All those files are split by the code we've done before.
Here is the result after translating with the model that trained with the above small files. Somehow it gives blank but no <unk>
The steps to have the above result in the image.
Manually select and copy the first 100 lines from src.train.txt and tgt.train.txt. ( I'm uploading the 6 .txt files I spit already from dataset including these two here on Teams ). Then create and paste them to src.dataset.txt and tgt.dataset.txt accordingly.
Run this split_test.py code here https://github.com/Pax636/4990_IMDB/blob/main/split_test.py by import the code and run it's method. It should be placed with src.dataset.txt and tgt.dataset.txt together in the same folder.
Then you will have the 6 txt files needed for training on OpenNMT, link them in the config file toy_en_de.yaml. Here is what I wrote, you should change the addresses. https://github.com/Pax636/4990_IMDB/blob/main/toy_en_de.yaml
The model.txt in the config file is created by two lines of code, you might created a python file and include theses two: embs = gensim.models.Doc2Vec.load(joint.emb.d100.w1.dm1.mdl) embs.save_word2vec_format("model.txt", binary=False)
my file is here, https://github.com/Pax636/4990_IMDB/blob/main/load.py
The model.txt is a 5.9 GB txt file, it takes a while to create, and I'm also uploading it to Teams.
Then you should get the same result of my image, sorry we did't solve it. Let me know if anything.
@Pax636 I really like your detailed logs/reports. Thank you! I'm busy with some deadlines. Please give me some time. I will try to reproduce your result and see what the problem may be.
The past two days I ran nmt with toy.dblp.v12.json and each time the predictions come out
fold2 results
predictions are
Unfortunately, I cannot view Pax636's code linked to their GitHub repo.
@hosseinfani , is this the current status for you also when running with the toy data?
@thangk have a look at https://github.com/fani-lab/OpeNTF/issues/79#issuecomment-1018634296
@jamil2388 any idea about this
@thangk have a look at #79 (comment)
@jamil2388 any idea about this
Thanks.
I already share a rar file that includes many files. Among the files, there are four files that start with joint.emb.d100.*
I can't find this rar file you're referring to anywhere in this repo. Is it somewhere else?
I already talked to Kap about the summary of the process. We need to generate the word2vec embedding files for the toy data first, then feed those files to the model to let it know about the pretrained embeddings. When I tried to work on that, I did not use these files that is mentioned as a rar file here. I recommended him to generate the word2vec embeddings or any other embedding by himself to be clear of the approach.
Get Outlook for Androidhttps://aka.ms/AAb9ysg
From: Kap Thang @.> Sent: Thursday, May 16, 2024 4:31:24 PM To: fani-lab/OpeNTF @.> Cc: Md Jamil Ahmed @.>; Mention @.> Subject: Re: [fani-lab/OpeNTF] OpenNMT for Team Formation on Sparse Matrix (Issue #79)
@thangkhttps://github.com/thangk have a look at #79 (comment)https://github.com/fani-lab/OpeNTF/issues/79#issuecomment-1018634296
@jamil2388https://github.com/jamil2388 any idea about this
Thanks.
I already share a rar file that includes many files. Among the files, there are four files that start with joint.emb.d100.*
I can't find this rar file you're referring to anywhere in this repo. Is it somewhere else?
— Reply to this email directly, view it on GitHubhttps://github.com/fani-lab/OpeNTF/issues/79#issuecomment-2116126373, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGWT6S3TL4JL32CAUVLYAQ3ZCUJRZAVCNFSM5H5G32DKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMJRGYYTENRTG4ZQ. You are receiving this because you were mentioned.Message ID: @.***>
@hosseinfani
After sucessfully installing the deps required using conda, I was able to run the main.py in team2vec and re-ran the nmt model. This time, it produced sensible results.
@thangk good! two things:
So, now, it's time to run on entire datasets.
So, now, it's time to run on entire datasets.
I ran it with the entire dataset of dblp last night but I had errors. I'll look into it today.
@thangk The error is related to creating the sparse matrix in parallel.
While looking into fixing the issue, we have the sparse matrices for the entire datasets in OpeNTF channel. You can use them meanwhile
I've downloaded the teamsvecs.pkl
from OpeNTF channel and use it with the main main.py and I think it's trying to place a lot of the data it's working on, onto the memory.
@thangk so, I add you to our server channel. put it on server.
I've transferred the project file onto the matrix server and now attempting to run on the full dataset. But first testing it out on the toy dataset.
@lilizhu1 @xcyyygithub @Pax636 Now that you could successfully run the opennmt for its sample en-de translation, we want you to run it on our sparse matrix. So, given a sparse matrix each row of which is a team like:
We want to train a translator between skills and members. This translation assumes that the
source
language talks in numbers of skill indexes and thetarget
language is in the member indexes.So, the sparse matrix is already available for
dblp
dataset (teams of researchers) in./data/preprocessed/dblp/toy.dblp.v12.json/teamsvecs.pkl
You have to open the file and load the sparse matrix like in: https://github.com/fani-lab/neural_team_formation/blob/97b9fc88f3bee2dc454ed22b51b561b81a3a0881/src/cmn/team.py#L91
Then, write python code to generate the pairs of (skill => member) as a dataset for opennmt library.
Please let me know if you have any questions here.