derekgreene / topic-model-tutorial

Tutorial on topic models in Python with scikit-learn
156 stars 50 forks source link

(NMF) calculating coherence in NMF generates different outputs each time when method is called #1

Open Sannidhi-17 opened 3 years ago

Sannidhi-17 commented 3 years ago

I am trying to calculate the coherence value on each topic but each time when I run my code it generates different values.

It will be a great help if anyone can answer this.

Thank you in advance

def build_w2c(self, raw_documents):
    docgen = TokenGenerator(raw_documents, self.stop_words)
    new_list = []
    for each in docgen.documents:
        new_list.append(each.split(" "))
    # print(new_list)
    # Build the word2vec model
    self.w2v_model = gensim.models.Word2Vec(size=500, min_count=0.0005, sg=1)
    self.w2v_model.build_vocab(sentences=new_list)
    return self.w2v_model

def get_descriptor(self, all_terms, H, topic_index, top):
    # reverse sort the values to sort the indices
    top_indices = np.argsort(H[topic_index, :])[::-1]
    # now get the terms corresponding to the top-ranked indices
    top_terms = []
    for term_index in top_indices[0:top]:
        top_terms.append(all_terms[term_index])
    return top_terms
def get_coherence(self, k, terms, H):
    k_values = []
    term_rankings = []
    coherences = []
    dict = {}
    for topic_index in range(1, k):
        print(topic_index)
        descriptor = self.get_descriptor(terms , H, topic_index, 10)
        term_rankings.append(descriptor)
    # Now calculate the coherence based on our Word2vec model
    #coherence = self.calculate_coherence(term_rankings)
        coherences.append(self.calculate_coherence(term_rankings))
        print("K=%02d: Coherence=%.4f" % (topic_index, coherences[-1]))
        k_values.append(topic_index)
        dict[topic_index] = coherences[-1]
    max_key = max(dict, key=dict.get)
    return k_values, coherences, max_key

def calculate_coherence(self, term_rankings):
    overall_coherence = 0.0
    for topic_index in range(len(term_rankings)):
        # check each pair of terms
        pair_scores = []
        for pair in combinations(term_rankings[topic_index], 2):
            pair_scores.append(self.w2v_model.similarity(pair[0], pair[1]))
        # get the mean for all pairs in this topic
        topic_score = sum(pair_scores) / len(pair_scores)
        overall_coherence += topic_score
    # get the mean score across all topics
    return overall_coherence / len(term_rankings)

here is my code that I have used in my project.

Output Required: Each time when I run my code coherence should be the same.

if you can help me to resolve this approach it would be a great help thank you so much.

TimoFlesch commented 3 years ago

Hi,

the author of this tutorial might have a better understanding of this than I do, but it looks like gensim's word2vec model doesn't provide deterministic outputs. That is, each time you run the model, the learned w2v representation will be slightly different. According to this stackexchange thread https://stackoverflow.com/questions/34831551/ensure-the-gensim-generate-the-same-word2vec-model-for-different-runs-on-the-sam there are ways to enforce deterministic behaviour for reproducible results.

My guess would be that training the model for more epochs might also help, perhaps it just hasn't converged yet.

Best, Timo