bab2min / tomotopy

Python package of Tomoto, the Topic Modeling Tool
https://bab2min.github.io/tomotopy
MIT License
559 stars 63 forks source link

Is term weighting implemented correctly? #9

Closed Dobatymo closed 4 years ago

Dobatymo commented 5 years ago

Term weighting is implemented here by simply multiplying the counts with the weights before and after sampling.

https://github.com/bab2min/tomotopy/blob/0c6cb8081dbd2851e0a6c6768fe4732493846b43/src/TopicModel/LDAModel.hpp#L125-L127

However I don't think you can do that. It's ok for doc.numByTopic because here the document dependency is still kept. However for both ld counts the results are different from the implementation in the paper "Term Weighting Schemes for Latent Dirichlet Allocation" (eq. 6)

In the paper the original counts are multiplied with the weights during sampling. In code this should look like this (using numpy syntax and your variable names). termweights is a NumberOfDocument x VocabSize array and ld.numByTopicWord a NumberOfTopics x VocabSize array (with counts, not weights)

np.sum(termweights[docid,:][None,:] * ld.numByTopicWord[tid, vid], axis=-1)

I did some testing with my pure python implementation, and there this expression yields a different result as ld.numByTopic[tid] using the weights update in addWordTo.

Note that this should only matter for weighting schemes where the same tokens can have different weights for different documents (like PMI)

bab2min commented 5 years ago

Thank you for your reviewing. Actually, the addWordTo function does only increase or decrease the number of words by one in a particular document. INC can be only -1 or 1, and weight can differ depending on document doc. So that function can be written using python like this

ld.numByTopic[tid] += inc * termweights[docid, word_id]
ld.numByTopicWord[tid, vid] += inc * termweights[docid, word_id]

Since this function is called from different docids, the total sum of ld.numByTopic and ld.numByTopic is like

sum(countByTopicDocWord[tid, docid, vocab_of[docid, word_id]] * termweight[docid, word_id] for docid in all_docs for word_id in words_of[docid])

where countByTopicDocWord is Topic X Doc X Vocab array (with count), vocab_of is Doc x Words array (with vocabulary id of each word). So I think the summation of weight of topic and topic-word array is correct for the paper.

Dobatymo commented 5 years ago

Yes, the totals should match I think. But during sampling, you are not dealing with counts totaled over all words in all docs. Let me give an example.

def addWordToINC(docId, pid, tid, vid):

    # paper term before W*beta in eq. 6 (https://www.aclweb.org/anthology/N10-1070.pdf)
    ld.numByTopicWordCOUNTS[tid, vid] += 1
    Nk = np.sum(docs[docId].wordWeights[None,:] * ld.numByTopicWordCOUNTS, axis=-1) # [None,:] needed for correct broadcasting. summed over words, K size array remains
    # or the above in plain python
    Nk = [sum(docs[docId].wordWeights[word_id] * ld.numByTopicWordCOUNTS[tid, word_id]
        for word_id in docs[docid].words)
            for tid in range(K)] # K size array

    # addWordTo INC
    ld.numByTopic[tid] += docs[docId].wordWeights[pid] # ld.numByTopic is a K size array

    # now both weight sums should be the same AT ALL TIMES
    assert Nk == ld.numByTopic # elementwise comparison

    # but they are not after the first document, if the weights depend on the document

I implemented both ways, yours and the papers the way I understand them and compare them each time a count is added.

Running that gives me a output like these documents as input:

    # input K=2, vocab size=10
    docs = [
        [0, 6, 3, 9, 1, 5],
        [9, 8, 8, 7, 1, 0],
        [2, 4, 4, 5, 0, 1],
    ]
docId=0, vid=0, tid=1
ld.numByTopicWordCOUNTS:
 [[0 0 0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0]]
paper sum [0.         0.48119529]
ld.numByTopic [0.        0.4811953]
--------
docId=0, vid=6, tid=1
ld.numByTopicWordCOUNTS:
 [[0 0 0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 1 0 0 0]]
paper sum [0.         1.21683363]
ld.numByTopic [0.        1.2168336]
--------
docId=0, vid=3, tid=0
ld.numByTopicWordCOUNTS:
 [[0 0 0 1 0 0 0 0 0 0]
 [1 0 0 0 0 0 1 0 0 0]]
paper sum [0.38005087 1.21683363]
ld.numByTopic [0.38005087 1.2168336 ]
--------
docId=0, vid=9, tid=1
ld.numByTopicWordCOUNTS:
 [[0 0 0 1 0 0 0 0 0 0]
 [1 0 0 0 0 0 1 0 0 1]]
paper sum [0.38005087 1.98387423]
ld.numByTopic [0.38005087 1.9838742 ]
--------
docId=0, vid=1, tid=1
ld.numByTopicWordCOUNTS:
 [[0 0 0 1 0 0 0 0 0 0]
 [1 1 0 0 0 0 1 0 0 1]]
paper sum [0.38005087 2.35522387]
ld.numByTopic [0.38005087 2.355224  ]
--------
docId=0, vid=5, tid=0
ld.numByTopicWordCOUNTS:
 [[0 0 0 1 0 1 0 0 0 0]
 [1 1 0 0 0 0 1 0 0 1]]
paper sum [0.91771158 2.35522387]
ld.numByTopic [0.91771156 2.355224  ]
--------
docId=1, vid=9, tid=0
ld.numByTopicWordCOUNTS:
 [[0 0 0 1 0 1 0 0 0 1]
 [1 1 0 0 0 0 1 0 0 1]]
paper sum [1.85772039 2.65714406]
ld.numByTopic [1.2884812 2.355224 ]

You can see that in the last output, the docId changes to 1 and now both array are not equal anymore. If you are doing the same analysis for doc.numByTopic you will find that it works however.

I cannot really give a good explanation why it doesn't work, I just can see that both solutions give different results for beta part of the sampling equation.

Of course it is possible that I either misunderstand your implementation or the paper, but I have spend some time now on understanding both, and I think the calculations differ. The problem is, that calculating this sum at every step increases computation time by a nontrivial amount compared to just adding/subtracting the weights.

But even in case your implementation really is different, maybe it's still makes sense from a modelling perspective.

bab2min commented 5 years ago

Thank you for kindly attaching an example.

But the last part (docId = 1, vid=9, tid=0) of calculation you shared above looks weird. At the previous part, paper sum is [0.9177, 2.355], and a new word(docid=1, vid=9) has been assigned to tid=0. So, I expect only the first part of paper sum to be increased. But in your paper sum, both the first and second part of paper sum have been changed. Why did second part change from 2.355 -> 2.657?

Did wordWeights of doc change during sampling?

Dobatymo commented 5 years ago

It makes sense. The wordWeights of doc didn't change. But doc changed, so the associated weights can be completely different. You can see that the wordWeights in the paper equation depend on docId.

bab2min commented 5 years ago

Oh, I see where the difference is. In paper weight of word m(x) was defined as m(x, d) in PMI weighting. Here is Eq(6) CodeCogsEqn (1) In PMI weighting scheme, Eq(6) is converted to following: CodeCogsEqn (2) whered_i is the document of word i, and d_w is the document of word w.

Actually the paper said only m(x_i, d), but I think there is an omission there and the notation d in Eq(7) should be dependent on word x_i. Therefore, when adding the weights of each topic, it is appropriate to calculate the weights of all words depending on their own document, rather than on the document currently sampling.

Otherwise, the same word in the same document will have different weights depending on the document where it is sampled. In this case we will encounter undefined weights for some words. Consider the word vid = 2 in the example you showed. In the first and second document, vid = 2 does not appear at all. In this case, therefore, the PMI of vid = 2 is not defined in these two documents (we cannot define log(0)). I think it is nonsense.

Dobatymo commented 5 years ago

I understand what you mean. The paper is a bit ambiguous here.

But in my understanding d_w doesn't really make sense. A word is associated with more than one document. So which one is it during sampling? Also they write that the PMI is clamped to zero, so there wouldn't be any negative and presumably no undefined values.

In my opinion the equation in the paper reads in part: In the context of the document we are currently sampling, we use the weight of the current word in this document and the sum of the counts of the topic k weighted with all possible words in this document.

I will try to contact the authors, maybe they can clarify.

atwilso commented 4 years ago

Hi, I'm one of the authors. It's been a while since we wrote that paper but I have the code for our reference implementation that I can consult. It's open source so I can post snippets whenever useful.

I can tackle a couple of these questions right away. We avoided undefined term weights by setting them to 0 on the theory that a term that does not appear in a document contributes nothing to its information content. Second, all of our arithmetic on the arrays used in the Gibbs sampler (theta = term|topic, phi = topic|document) was done purely with term weights instead of counts.

I don't yet understand the rest of your question about PMI. When considering a token (word) of a particular type within a document, we use that type's score with respect to that document. Can you help me get from there to the point of contention?

bab2min commented 4 years ago

Oh, hi! Thank you for visiting this place. I am implementing term weighting based on your paper, and there are some confusing parts in implementing it. I think when the summation of word weights per topic is calculated, the weights of each word is based on their document. And Dobatymo thinks the weights of each word is based not on their document, but on a document which is sampled. Both seem convincing. It would be very helpful if you let me know which you intended, or share the piece of code that you used in the experiment.

atwilso commented 4 years ago

Very much belated, but hopefully still relevant --

You're right. The paper doesn't adequately address that case. When I wrote the code, I assumed we were using a single weight for each term in the vocabulary. PMI was a last-minute addition.

Let's see here. The trick is to make sure that when we remove a term from a topic, we remove exactly the weight that was added. To do that we need to remember what document the specific instance of that term came from. Fortunately, the Z array in the Gibbs sampler does that for us, so if we've also got a (term x document) -> weight table, we can keep it consistent.

The table of term weights is probably sparse enough that it's worth storing as a hash table of some sort instead of instantiating the entire array.

bab2min commented 4 years ago

Thank you for your kind reply.

Now everything is clearly understood. Actually, my current implementation stores '(term, document) -> weight table' not as a hash table, but an array. Nevertheless, the operation is expected to be the same as your implementation.

I look forward to your continued nice research. Thank you again.

And thanks to @Dobatymo for leading a productive discussion.

Dobatymo commented 4 years ago

@atwilso Thanks for your explanations! @bab2min Yes, I agree that your implementation seems to implement his idea. Thanks for looking into this anyway :)