MS20190155 / Measuring-Corporate-Culture-Using-Machine-Learning

Code Repository for MS20190155
135 stars 97 forks source link

Division by Zero Error #6

Closed HuzaifahSaleem closed 2 years ago

HuzaifahSaleem commented 2 years ago

Hey Authors, I have been trying to run the code for create_dicty.py on a text file converted via csv file using pandas. I am getting the division by zero error on word2vec.n_similarity function call and it says atleast one of the 3 lists are empty. But i have checked, no list is empty. The model, the seed words and the expanded words too.

Please help

maifeng commented 2 years ago

Can you share the version of gensim? gensim.__version__

When you run create_dicty.py, are you using the example data and seed words or your own?

HuzaifahSaleem commented 2 years ago

gensim 3.8.3 our own data. Do you want to see a sample of it? its glassdoor review text

HuzaifahSaleem commented 2 years ago

seed words are the same since we are pretty much trying to achieve the same goal.

HuzaifahSaleem commented 2 years ago

image

HuzaifahSaleem commented 2 years ago

After putting some print statements, we reached the root cause of the error. in culture_dictionary.py --> deduplicate_keywords function, the dimension_seed_words returns an empty list:

dimension_seed_words = [ word for word in seed_words[dimension] if word in word2vec_model.wv.vocab]

this line returns empty list. if we remove the condition wv.vocab, it does return a list but again it says, the word is not in vocab.

maifeng commented 2 years ago

Yes, it meant none of the seed words are in the word2vec vocabulary. By default, a word needs to appear at least 5 times to be included. If you have a small corpus, you can add the argument min_count=1 to the train_w2v_model() function in clean_and_train.py. Also, make sure that seed words are in lowercase.

HuzaifahSaleem commented 2 years ago

Hey, Thanks a lot for the help. it worked like magic. Can you please suggest some tips on how to improve its accuracy? I am assuming if we increase the min_count value, it also improves the accuracy? what other things can we do? i think we cannot replace word2vec with bert or something since the code is filled with word2vec functions

maifeng commented 2 years ago

It is always helpful to have a larger corpus. Words that appear only a few times may not have a good representation after training. The code does not train bert; it produces static word vectors.

HuzaifahSaleem commented 2 years ago

Hey @maifeng How are you doing? i was wondering if there is any way to reduce the time for parsing? I am running 10,000 text blocks using parse_parallel and it takes almost 45 mins. is there any way to make it even more optimal other than just using more CPU cores?

HuzaifahSaleem commented 2 years ago

Hey @maifeng there??