Closed HuzaifahSaleem closed 2 years ago
Can you share the version of gensim?
gensim.__version__
When you run create_dicty.py
, are you using the example data and seed words or your own?
gensim 3.8.3 our own data. Do you want to see a sample of it? its glassdoor review text
seed words are the same since we are pretty much trying to achieve the same goal.
After putting some print statements, we reached the root cause of the error. in culture_dictionary.py --> deduplicate_keywords function, the dimension_seed_words returns an empty list:
dimension_seed_words = [ word for word in seed_words[dimension] if word in word2vec_model.wv.vocab]
this line returns empty list. if we remove the condition wv.vocab, it does return a list but again it says, the word is not in vocab.
Yes, it meant none of the seed words are in the word2vec vocabulary. By default, a word needs to appear at least 5 times to be included. If you have a small corpus, you can add the argument min_count=1
to the train_w2v_model()
function in clean_and_train.py
. Also, make sure that seed words are in lowercase.
Hey, Thanks a lot for the help. it worked like magic. Can you please suggest some tips on how to improve its accuracy? I am assuming if we increase the min_count value, it also improves the accuracy? what other things can we do? i think we cannot replace word2vec with bert or something since the code is filled with word2vec functions
It is always helpful to have a larger corpus. Words that appear only a few times may not have a good representation after training. The code does not train bert; it produces static word vectors.
Hey @maifeng How are you doing? i was wondering if there is any way to reduce the time for parsing? I am running 10,000 text blocks using parse_parallel and it takes almost 45 mins. is there any way to make it even more optimal other than just using more CPU cores?
Hey @maifeng there??
Hey Authors, I have been trying to run the code for create_dicty.py on a text file converted via csv file using pandas. I am getting the division by zero error on word2vec.n_similarity function call and it says atleast one of the 3 lists are empty. But i have checked, no list is empty. The model, the seed words and the expanded words too.
Please help