bab2min / tomotopy

Python package of Tomoto, the Topic Modeling Tool
https://bab2min.github.io/tomotopy
MIT License
560 stars 63 forks source link

how to compute coherence score for a trained LDA model ? #73

Closed lemuria-wchen closed 3 years ago

bab2min commented 4 years ago

Currently, tomotopy doesn't provide any function about topic coherence. Thus you may use gensim's coherencemodel or compute the score manually. I plan to add similar features to gensim's coherencemodel in the next update, so please use above options until then.

bab2min commented 4 years ago

To do: http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf

ZechyW commented 4 years ago

Thank you for looking into this, @bab2min!

@FDU-SDS: In the meantime, here is a small snippet for getting the coherence scores from a tomotopy model via gensim, if it helps:

import collections

import gensim

def get_coherence(
        model, coherence=None, topn=None, window_size=None, processes=None
    ):
    """
    Calculates the coherence score for a given Tomotopy model via Gensim's 
    `coherencemodel` pipeline.  

    Parameters
    ----------
    model: Tomotopy.LDAModel
        The Tomotopy model to get coherence scores for.

    coherence: str, optional
    topn: int, optional
    window_size: int, optional
    processes: int, optional
        All of these parameters are passed directly to 
        `gensim.models.coherencemodel.CoherenceModel`, and the Gensim defaults will 
        apply if they are omitted.

    Returns
    -------
    float
        The coherence score for the model.
    """

    topics = []
    for k in range(model.k):
        word_probs = model.get_topic_words(k, topn)
        topics.append([word for word, prob in word_probs])

    texts = []
    corpus = []
    for doc in model.docs:
        words = [model.vocabs[token_id] for token_id in doc.words]
        texts.append(words)
        freqs = list(collections.Counter(doc.words).items())
        corpus.append(freqs)

    id2word = dict(enumerate(model.vocabs))
    dictionary = gensim.corpora.dictionary.Dictionary.from_corpus(
        corpus, id2word
    )

    cm = gensim.models.coherencemodel.CoherenceModel(
        topics=topics,
        texts=texts,
        corpus=corpus,
        dictionary=dictionary,
        window_size=window_size,
        coherence=coherence,
        topn=topn,
        processes=processes,
    )

    return cm.get_coherence()
bab2min commented 3 years ago

Since version 0.10.0, the module tomotopy.utils.coherence was added. Please see the example: https://github.com/bab2min/tomotopy/blob/main/examples/coherence.py