Optimizing the number of topics and filtering unwanted word categories

amirmohammadkz commented 2 years ago

Contextualized Topic Models version: 2.2.1
Python version: 3.9
Operating System: Windows 10

Description

I want to optimize the number of topics, and to do that, I implemented External Word Embeddings Topic Coherence (alpha) metric which you recommended (https://github.com/MilaNLProc/contextualized-topic-models/issues/93#issuecomment-949579725) and introduced in your paper. Scores in the picture are average cosine distance so I pick the model with the least score.

Regarding the result I got from the model, I have some problems:

Is there any way to omit words such as conjunctions/ helping verbs/ etc which do not give us much information about the topic both in the training and prediction phase? I know that after the training phase I can get more than 5 words per topic and filter unwanted word categories, but in that case, the model still distinguishes topics based on the unwanted words.
In some cases, bi-grams are needed so the result makes sense. Suppose I get "get" in one of the word classes. It might be "get up", "get on", or other phrasal verbs that make completely different meanings than the solitary "get". Is there any way to resolve this issue?

What I Did

This is the topic coherence calculator I implemented for a Kitty word classes result:

def compute_topic_coherence(word_classes, wv=None):
    def word_list2vecs(wv, word_list):
        vecs = []
        for word in word_list:
            try:
                vecs.append(wv[word])
            except KeyError:
                continue
        return vecs

    def compute_external_coherence(vec_list):
        cos_list = []
        for i in range(len(vec_list)):
            for j in range(i, len(vec_list)):
                if not i == j:
                    cos_list.append(cosine(vec_list[i], vec_list[j]))
        return np.mean(cos_list)

    if wv is None:
        wv = api.load('word2vec-google-news-300')

    vec_classes = []
    for word_class in word_classes:
        vec_classes.append(word_list2vecs(wv, word_class))

    cos_list = []
    for vec_class in vec_classes:
        cos_list.append(compute_external_coherence(vec_class))

    return np.mean(cos_list)

And this is what I got after testing [3,5,7,9,11] topics:

vinid commented 2 years ago

Hello @amirmohammadkz!

you would need to clean the documents in advance. If you look here kitty allows you to input a list of stopwords to remove.
You can use gensim's phrases to create the bigrams, but the bigrammed text is going to be used by the embedding model and this might be a suboptimal option. However, it is definitely worth a try. You could also manually train a ZeroShotTM model (with your custom pre-processing9 and then initialize a Kitty object with it.

Let me know if this helps!

amirmohammadkz commented 2 years ago

Hello @vinid,

Thanks for answering my question.

you would need to clean the documents in advance. If you look here kitty allows you to input a list of stopwords to remove.

Oh, I see. Based on the Kitty example, since you have downloaded the nltk stopwords, I assumed the model uses it itself.

You can use gensim's phrases to create the bigrams, but the bigrammed text is going to be used by the embedding model and this might be a suboptimal option. However, it is definitely worth a try. You could also manually train a ZeroShotTM model (with your custom pre-processing9 and then initialize a Kitty object with it.

So, let me ask some questions for clarification. The Kitty model uses the ZeroShotTM. Correct? So the result of running Kitty example and ZeroShotTM example would be the same (since the same random seed is determined for both). Correct?

By using the bi-grams, SentenceBERT models will not detect bigrams as known tokens. Correct?

And by manually training a ZeroShotTm with custom preprocessing, I think you are suggesting me to implement something like WhiteSpacePreprocessing: Correct?

Thanks in advance for the further clarification.

vinid commented 2 years ago

Oh, I see. Based on the Kitty example, since you have downloaded the nltk stopwords, I assumed the model uses it itself.

yea, you are right, the doc is not updated. I need to fix this. If you look at the colab, you see that we pip install version 2.2.0 and not 2.2.1, that is the one that has this fix

The Kitty model uses the ZeroShotTM. Correct? So the result of running Kitty example and ZeroShotTM example would be the same (since the same random seed is determined for both). Correct?

yep, that's right! if you open the kitty code you'll see that's just a wrapper over ZeroShotTM (you can build a custom kitty with a few lines of code)

By using the bi-grams, SentenceBERT models will not detect bigrams as known tokens. Correct?

yes, unfortunately, that's the issue. This is why I suggest the "custom" thing

And by manually training a ZeroShotTm with custom preprocessing, I think you are suggesting me to implement something like WhiteSpacePreprocessing:

Exactly, that's the best way to implement your own custom behavior

Appy to accept a pull request to address this issue if you have time to write some code for this :)

MilaNLProc / contextualized-topic-models

Optimizing the number of topics and filtering unwanted word categories #102

Description

What I Did