adjidieng / ETM

Topic Modeling in Embedding Spaces
MIT License
549 stars 128 forks source link

Topic coherence calculation #13

Open JerrryNie opened 4 years ago

JerrryNie commented 4 years ago

Hi, I've read your code and have some questions about the function "get_document_frequency" and "get_topic_coherence" in your utils.py.

def get_document_frequency(data, wi, wj=None):
    if wj is None:
        D_wi = 0
        for l in range(len(data)):
            doc = data[l].squeeze(0)
            if len(doc) == 1: 
                continue
            else:
                doc = doc.squeeze()
            if wi in doc:
                D_wi += 1
        return D_wi
    D_wj = 0
    D_wi_wj = 0
    for l in range(len(data)):
        doc = data[l].squeeze(0)
        if len(doc) == 1: 
            doc = [doc.squeeze()]
        else:
            doc = doc.squeeze()
        if wj in doc:
            D_wj += 1
            if wi in doc:
                D_wi_wj += 1
    return D_wj, D_wi_wj 
def get_topic_coherence(beta, data, vocab):
    D = len(data) ## number of docs...data is list of documents
    print('D: ', D)
    TC = []
    num_topics = len(beta)
    for k in range(num_topics):
        print('k: {}/{}'.format(k, num_topics))
        top_10 = list(beta[k].argsort()[-11:][::-1])
        top_words = [vocab[a] for a in top_10]
        TC_k = 0
        counter = 0
        for i, word in enumerate(top_10):
            # get D(w_i)
            D_wi = get_document_frequency(data, word)
            j = i + 1
            tmp = 0
            while j < len(top_10) and j > i:
                # get D(w_j) and D(w_i, w_j)
                D_wj, D_wi_wj = get_document_frequency(data, word, top_10[j])
                # get f(w_i, w_j)
                if D_wi_wj == 0:
                    f_wi_wj = -1
                else:
                    f_wi_wj = -1 + ( np.log(D_wi) + np.log(D_wj)  - 2.0 * np.log(D) ) / ( np.log(D_wi_wj) - np.log(D) )
                # update tmp: 
                tmp += f_wi_wj
                j += 1
                counter += 1
            # update TC_k
            TC_k += tmp 
        TC.append(TC_k)
    print('counter: ', counter)
    print('num topics: ', len(TC))
    TC = np.mean(TC) / counter
    print('Topic coherence is: {}'.format(TC))

In your code, you calculate "D_wj" by using "D_wj, D_wi_wj = get_document_frequency(data, word, top_10[j])". But if I use "D_wj = get_document_frequency(data, top_10[j])" to get the value of "D_wj" as you've done in the calculation of "D_wi", it seems reasonable because "D_wi" and "D_wj" should have been calculated in the same way. And when the condition "len(doc) == 1" is true, we need to jump to the next iteration as you write in your code:

for l in range(len(data)):
            doc = data[l].squeeze(0)
            if len(doc) == 1: 
                continue

However, when using "D_wj, D_wi_wj = get_document_frequency(data, word, top_10[j])", according to the function "get_document_frequency", the calculation of "D_wj" will jump to the second half part of this function and we will encounter:

if len(doc) == 1: 
            doc = [doc.squeeze()]

But then, we will encounter this part (for one word document condition):

if wj in doc:
            D_wj += 1

I don't think this is a proper method to deal with the calculation of D_wj. Therefore, I suspect this calculation has some problems. Thanks!

mona-timmermann commented 3 years ago

Do you know why in the paper it says they divide by 45 to compute topic coherence?

Screen Shot 2020-12-10 at 16 38 23
ahoho commented 3 years ago

@mona-timmermann it's taking the mean. There are 45=n*(n-1)/2 summations