dice-group / Palmetto

Palmetto is a quality measuring tool for topics
GNU Affero General Public License v3.0
209 stars 36 forks source link

Pairwise topic similarity #2

Closed ramkikannan closed 8 years ago

ramkikannan commented 8 years ago

I am using your project Palmetto to compute pair wise topic similarity using probabilities derived out of external wikipedia data. Currently the UCI score considers pairwise similarity of words among the same topic. However, I would like to define pair wise topic similarity as score between two topics. Assume a given document has only two topics.

topic1 : company, firm, fund, round, ventures topic2 : power, project, company, india, energy

Currently your implementation has UCI defined for one topic. That is., UCI_topic1=1/20(\sumi \sum{i+1}^n pmi(w_i,w_j))

PMI(topic1,topic2) = 1/25(\sum_i \sum_j pmi(topic1[i],topic2[j])), where topic1[i] is the ith word from topic1.

From the code I understand that in the call DirectConfirmationBasedCoherence(new OneOne(), getWindowBasedProbabilityEstimator(10, (WindowSupportingAdapter) corpusAdapter), new LogRatioConfirmationMeasure(), new ArithmeticMean()), I have to replace the first parameter OneOne with some other definition that can populate the combination for pairwise topic similarity. I change OneOne to OneAll and AllAll expecting the probabilities[0].probabilities.length to 25, but both the changes didn’t have any impact expect that the UCI scores for OneAll and AllAll was zeros.

For your reference, I have attached the topic file generated out of enron dataset. There are totally 10 topics with each 5 words. Hence there are totally 45 pairwise topic similarity scores, lower triangle of 10x10 pairwise topic similarity matrix. . That is., UCI(topic1,topic2), UCI(topic1,topic3)…,UCI(topic1,topic10),UCI(topic2,topic3),…,UCI(topic9,topic10).

Kindly let me know, how to compute the pairwise topic similarity scores which is the average of pairwise PMI score of words between topics? Appreciate your kind help.

MichaelRoeder commented 8 years ago

Hi Ramki,

unfortunately, I could not recreate your problem since you have not attached your topics as you have written. Thus, I can only give you a more general answer.

There are two possible ways, how you can proceed. First of all you have to know, that, even if the system takes all your topics at the same time, it will calculate their coherences one after the other. That means, that for every topic, the Segmentator, e.g., OneOne, is called for the current topic without seeing the other topics. Thus, the workflow does not support the comparison of two topics, directly.

Your first possibility would be to adapt the workflow to your needs. But this might need a lot of effort.

The second possibility woule be to implement a workaround. For every topic pair that you want to compare, you could create a large topic by concatenating the two single topics, i.e., topic_1={w_11, ..., w_1n} and topic_2={w_21, ..., w_2n} are concatenated to topic_1,2={w_11, ..., w_1n, w_21, ..., w_2n}. Now, you could use these long topics as input for palmetto and implement a Segmentator, that compares every word from the first half of the large topic with every word of the second half and vice versa. The following implementation should do this.

import org.aksw.palmetto.data.SegmentationDefinition;
import org.aksw.palmetto.subsets.Segmentator;

import com.carrotsearch.hppc.BitSet;

/**
 * Simple example of a {@link Segmentator} that gets a word set comprising two
 * topics and creates a {@link SegmentationDefinition} with which every word of
 * one of the topics is compared to every other word of the other topic.
 * <b>Note</b> that the word set size passed to this Segmentator has to be even!
 * 
 * @author Michael R&ouml;der (roeder@informatik.uni-leipzig.de)
 *
 */
public class PairwiseTopicComparingSegmentator implements Segmentator {

    @Override
    public SegmentationDefinition getSubsetDefinition(int wordsetSize) {
        // we assume, that the word set contains two topics. Thus, the number
        // has to be even
        if ((wordsetSize % 2) != 0) {
            throw new IllegalArgumentException(
                    "Got a word set size that is not even. Thus, it can not contain two topics that have an equal length.");
        }
        /*
         * Code the combinations of elements not with ids but with bits. 01 is
         * only the first element, 10 is the second and 11 is the combination of
         * both.
         */
        int singleTopicSize = wordsetSize / 2;
        int secondTopicLowestBit = 1 << singleTopicSize;
        int conditions[][] = new int[wordsetSize][singleTopicSize];
        int segments[] = new int[wordsetSize];
        int condBit, condPos, bit = 1, pos = 0;
        int mask = (1 << wordsetSize) - 1;
        BitSet neededCounts = new BitSet(1 << wordsetSize);
        while (bit < mask) {
            segments[pos] = bit;
            neededCounts.set(bit);
            condPos = 0;
            // if this is a word of the first topic
            if (pos < singleTopicSize) {
                condBit = secondTopicLowestBit;
                while (condBit < mask) {
                    neededCounts.set(bit + condBit);
                    conditions[pos][condPos] = condBit;
                    ++condPos;
                    condBit = condBit << 1;
                }
            } else {
                condBit = 1;
                while (condBit < secondTopicLowestBit) {
                    neededCounts.set(bit + condBit);
                    conditions[pos][condPos] = condBit;
                    ++condPos;
                    condBit = condBit << 1;
                }
            }
            bit = bit << 1;
            ++pos;
        }
        return new SegmentationDefinition(segments, conditions, neededCounts);
    }

    @Override
    public String getName() {
        return "one-topic";
    }
}

Does this answer your question?

Cheers, Michael

ramkikannan commented 8 years ago

Yes. It works.