Leveraging BERT and c-TF-IDF to create easily interpretable topics.
About Coherence of topic models #90

nadiafelix opened 3 years ago

nadiafelix commented 3 years ago

Currently, I am calculating the Coherence of a bertopic model using the gensim. For this I need the n_grams from each text of the corpus. Is it possible? The function used by gensim waits for the corpus and topics, and the topics are tokens that must exist in corpus.

cm = CoherenceModel(topics, corpus, dictionary, coherence='u_mass')

Thanks in advance.

MaartenGr commented 3 years ago

I believe you should be using the CountVectorizer for creating the corresponding corpus and dictionary when creating the CoherenceModel.

nadiafelix commented 3 years ago

@MaartenGR thanks a lot for you attention. I am trying this. But I found a sentence in topics set that doesn't exist in dictionary. Is it ok? Do all the topics exist in ngrams?

The used code is this:

from gensim import corpora import nltk'punkt') from gensim.models.coherencemodel import CoherenceModel

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(ngram_range=(2, 20)) #2,20 is the same range of topics cv_fit=cv.fit_transform(comentariosList)

texts = []

for i in range(len(comentariosList)): temp = np.array(cv.inverse_transform(cv_fit.getrow(i))).tolist() texts = texts + temp

topics = topics_df['Keywords'].values.tolist()

cm = CoherenceModel(topics=topics, corpus=corpus, dictionary=dictionary, coherence='u_mass') cm.get_coherence_per_topic()

Thanks for your help.

MaartenGr commented 3 years ago

You should focus on what you put into the corpus and dictionary variables as the topics are checked against those two. At the moment, I cannot see how you have constructed them but I would advise you to look into those.

nadiafelix commented 3 years ago

Do you have any recommendations for working with this n_gram_range parameter?

topic_model = BERTopic (verbose = True, embedding_model = embedder, n_gram_range = (1,3), calculate_probabilities = True)

MaartenGr commented 3 years ago

I believe it is best to make sure that the Countvectorizer in Bertopic should be the same as you used to create the dictionary, corpus and tokens.

You could also try accessing the Countvectorizer directly in Bertopic by using model.vectorizer_model. That way, you do not have to create different instances that might not match exactly.

If this still does not work let me know!

I would suggest that instead of creating n_grams of the corpus, you can simply split the n_grams of the topics and flatten them to have a list of single words (unigram) so that you can perform gensim CoherenceNPM scores without having to create the n_grams of text.

nadiafelix commented 3 years ago

I believe it is best to make sure that the Countvectorizer in Bertopic should be the same as you used to create the dictionary, corpus and tokens.

You could also try accessing the Countvectorizer directly in Bertopic by using model.vectorizer_model. That way, you do not have to create different instances that might not match exactly.

If this still does not work let me know!

First of all, Thank you for your attention.
When I try to use the vectorizer_model from Bertopic we have this error:

1 corpus = ['This is the first document.','This document is the second document.','And this is the third one.','Is this the first document?',] ----> 2 cv = topic_model.vectorizer_model() 3 4 X = cv.fit_transform(corpus)

TypeError: 'CountVectorizer' object is not callable

I would suggest that instead of creating n_grams of the corpus, you can simply split the n_grams of the topics and flatten them to have a list of single words (unigram) so that you can perform gensim CoherenceNPM scores without having to create the n_grams of text.

Hi Amine-OMI, thank you for your tips. Do you have some example of gensim CoherenceNPM?

Thanks a lot for your attention.

MaartenGr commented 3 years ago

You should access the vectorizer model like this: cv = topic_model.vectorizer_model. Since it is already fitted you can use something like cv.get_feature_names() and tokenizer = cv.build_tokenizer() to get the words and tokenizer used for constructing the dictionary and corpus.

Viole-Grace commented 3 years ago

I believe it is best to make sure that the Countvectorizer in Bertopic should be the same as you used to create the dictionary, corpus and tokens. You could also try accessing the Countvectorizer directly in Bertopic by using model.vectorizer_model. That way, you do not have to create different instances that might not match exactly. If this still does not work let me know!

First of all, Thank you for your attention. When I try to use the vectorizer_model from Bertopic we have this error:

1 corpus = ['This is the first document.','This document is the second document.','And this is the third one.','Is this the first document?',] ----> 2 cv = topic_model.vectorizer_model() 3 4 X = cv.fit_transform(corpus)

TypeError: 'CountVectorizer' object is not callable

Hey! Use it as such:

cv = topic_model.vectorizer_model
X = cv.fit_transform(docs)
doc_tokens = [text.split(" ") for text in docs]

import gensim.corpora as corpora
id2word = corpora.Dictionary(doc_tokens)
texts = doc_tokens
corpus = [id2word.doc2bow(text) for text in texts]

topic_words = []
for i in range(len(topic_model.get_topic_freq())-1):
  interim = []
  interim = [t[0] for t in topic_model.get_topic(i)]

from gensim.models.coherencemodel import CoherenceModel

coherence_model = CoherenceModel(topics=topic_words, texts=texts, corpus=corpus, dictionary=id2word, coherence='c_v')
I would suggest that instead of creating n_grams of the corpus, you can simply split the n_grams of the topics and flatten them to have a list of single words (unigram) so that you can perform gensim CoherenceNPM scores without having to create the n_grams of text.

Hi Amine-OMI, thank you for your tips. Do you have some example of gensim CoherenceNPM?

Thanks a lot for your attention.

Hey, sorry for the late reply, here's the process if you're still working on it:

Once you have extracted the topics from the corpus, you may have bigrams in the list of top words of each topic, so you need to split them and flatten the list to get a list of unigrams at the end.

After that you can use Gensime Topic coherence as described in this link

And you can use one of the following coherence measures: {'u_mass', 'c_v', 'c_uci', 'c_npmi'}.

from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora.dictionary import Dictionary
# Creat the dictionary of the input corpus
id2word = Dictionary(corpus)
npmi = CoherenceModel(texts=corpus, dictionary=id2word,
                       topics=flatten_unigrams, coherence='c_v')

I hope this helps you

MaartenGr commented 3 years ago

The following steps should be the correct ones in calculating the coherence scores. Some additional preprocessing is necessary since there is a very small part of that in BERTopic. Also, make sure to build the tokens with the exact same tokenizer as used in BERTopic.

I do want to stress that metrics such as c_v and c_npmi are merely proxies for a topic model's performance. They are by no means a ground truth and can have significant issues (e.g., sensitive to the number of words in a topic). So whether you find a low or high score, I would advise you to look at the topics yourself and see if they make sense to you.

import gensim.corpora as corpora
from gensim.models.coherencemodel import CoherenceModel

# Preprocess documents
cleaned_docs = topic_model._preprocess_text(docs)

# Extract vectorizer and tokenizer from BERTopic
vectorizer = topic_model.vectorizer_model
tokenizer = vectorizer.build_tokenizer()

# Extract features for Topic Coherence evaluation
words = vectorizer.get_feature_names()
tokens = [tokenizer(doc) for doc in cleaned_docs]
dictionary = corpora.Dictionary(tokens)
corpus = [dictionary.doc2bow(token) for token in tokens]
topic_words = [[words for words, _ in topic_model.get_topic(topic)] 
               for topic in range(len(set(topics))-1)]

# Evaluate
coherence_model = CoherenceModel(topics=topic_words, 
coherence = coherence_model.get_coherence()
nadiafelix commented 3 years ago

The following steps should be the correct ones in calculating the coherence scores. Some additional preprocessing is necessary since there is a very small part of that in BERTopic. Also, make sure to build the tokens with the exact same tokenizer as used in BERTopic.

I do want to stress that metrics such as c_v and c_npmi are merely proxies for a topic model's performance. They are by no means a ground truth and can have significant issues (e.g., sensitive to the number of words in a topic). So whether you find a low or high score, I would advise you to look at the topics yourself and see if they make sense to you.

import gensim.corpora as corpora
from gensim.models.coherencemodel import CoherenceModel

# Preprocess documents
cleaned_docs = topic_model._preprocess_text(docs)

# Extract vectorizer and tokenizer from BERTopic
vectorizer = topic_model.vectorizer_model
tokenizer = vectorizer.build_tokenizer()

# Extract features for Topic Coherence evaluation
words = vectorizer.get_feature_names()
tokens = [tokenizer(doc) for doc in cleaned_docs]
dictionary = corpora.Dictionary(tokens)
corpus = [dictionary.doc2bow(token) for token in tokens]
topic_words = [[words for words, _ in topic_model.get_topic(topic)] 
               for topic in range(len(set(topics))-1)]

# Evaluate
coherence_model = CoherenceModel(topics=topic_words, 
coherence = coherence_model.get_coherence()

Hello MaartenGr, I tried to execute this, but the problem is the tokenizer. My Bertopic model got topics with ngrams from 1 to 10 and the tokenizer here got tokens with only one term (1-gram). When I considere n_gram_range=(1,1) like this topic_model = BERTopic(verbose=True, embedding_model=embedder, n_gram_range=(1,1), calculate_probabilities=True) I get the coherence value, that in this case was 0.1725 for 'c_v', -0.2662 for c_npmi, and -8.5744 for u_mass.

MaartenGr commented 3 years ago

Good catch, I did not test for higher n-grams in the example. I made two changes:

Tested it with several ranges of n-grams and it seems to work now.

from bertopic import BERTopic
import gensim.corpora as corpora
from gensim.models.coherencemodel import CoherenceModel

topic_model = BERTopic(verbose=True, n_gram_range=(1, 3))
topics, _ = topic_model.fit_transform(docs)

# Preprocess Documents
documents = pd.DataFrame({"Document": docs,
                          "ID": range(len(docs)),
                          "Topic": topics})
documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
cleaned_docs = topic_model._preprocess_text(documents_per_topic.Document.values)

# Extract vectorizer and analyzer from BERTopic
vectorizer = topic_model.vectorizer_model
analyzer = vectorizer.build_analyzer()

# Extract features for Topic Coherence evaluation
words = vectorizer.get_feature_names()
tokens = [analyzer(doc) for doc in cleaned_docs]
dictionary = corpora.Dictionary(tokens)
corpus = [dictionary.doc2bow(token) for token in tokens]
topic_words = [[words for words, _ in topic_model.get_topic(topic)] 
               for topic in range(len(set(topics))-1)]

# Evaluate
coherence_model = CoherenceModel(topics=topic_words, 
coherence = coherence_model.get_coherence()
nadiafelix commented 3 years ago

Great! Thanks a lot!

YuanyuanLi96 commented 2 years ago

Hi Maarten, thanks for the code of calculating coherence score. I am wondering which parameter I can tune using coherence score. I tried min_topic_size =10, 7, 5, and it seems the coherence score is increasing as min_topic_size decreases. But it doesn't make sense to me to further reduce min_topic_size.

Is coherence score always decreasing as reducing min_topic_size(number of topics seems increasing)? And what else parameter you recommend to tune for a small dataset (about 1000 sentences)?

MaartenGr commented 2 years ago

@YuanyuanLi96 In general, I would not advise you to use this coherence score to fine-tune BERTopic. These metrics are merely procies for a topic model's performance. They are by no means a ground truth and can have significant issues (e.g., sensitive to the number of words in a topic). So whether you find a low or high score, I would advise you to look at the topics yourself and see if they make sense to you.

Having said that, by reducing min_topic_size the total amount of topics increases which simply leads to more information depending on the coherence metric used.

When it comes to tuning a small dataset, I would focus on keeping a logical min_topic_size of at least 20 since topics should contain sufficient documents. Moreover, with 1000 sentences, you can question whether a topic modeling technique is actually necessary.

YuanyuanLi96 commented 2 years ago

@MaartenGr Thanks for your explanation and suggestion! I tried to let min_topic_size =20, and I can get 16 mostly interpretable topics for my data. So I will go with this, since it performs better than other models and reduces out labor work in the long term. Thanks for this amazing package!

TomNachman commented 2 years ago

Hi @MaartenGr , regarding the conversation here and your reply to YuanyuanLi96, currently the only available measurements i found to evaluate a Topic Model is by Coherence(Umass,NPMI etc..) and Perplexity scores which both have their downsides, beside human judgement which like you said "I would advise you to look at the topics yourself and see if they make sense to you" is there any other measurement you suggest?

in short...if i have a LDA model and a ERTopic model trained on the same data and apply the same number of topics on both,how would i know which is more accurate?

MaartenGr commented 2 years ago

@TomNachman There are a few things that are important here.

What is the definition of "accurate". Is that topic coherence? Quality (density or separation) of clusters? Predictive power? Distribution of topics? Etc. Defining accuracy or quality first is important in knowing if one topic model is better than another. What the best metric to use is highly depends on your use case but it seems that in literature npmi is mostly used together with topic diversity. These metrics are typically used to evaluate the coherence and diversity of topic modeling techniques.

Moreover, I am often very hesitant when it comes to recommending a coherence metric to use. You can quickly overfit on such a metric when tuning the parameters of BERTopic (or any other topic modeling technique) which in practice might result in poor performance. In other words, I want to prevent users from solely focusing on grid-searching parameters and motivate users to look at the results.

Having said that, that does not mean that these metrics cannot be used! They are extremely useful in the right circumstances. So when you want to compare topic models, definitely use these kinds of metrics (e.g., npmi) but make sure the circumstances make sense. For example, they need to have the same number of topics and the same number of words need to be in those topics. If you were to change how the data were to be preprocessed, are you then objectively evaluating the difference in performance between topic modeling techniques?

I want to end with a great package for evaluating your topic model, namely OCTIS. It has many evaluation measures implemented aside from the standard coherence metrics, such as topic diversity, similarity, and classification metrics. I would advise choosing an evaluation metric there that best suits your use case.

PoonooP commented 2 years ago

The following steps should be the correct ones in calculating the coherence scores. Some additional preprocessing is necessary since there is a very small part of that in BERTopic. Also, make sure to build the tokens with the exact same tokenizer as used in BERTopic.

I do want to stress that metrics such as c_v and c_npmi are merely proxies for a topic model's performance. They are by no means a ground truth and can have significant issues (e.g., sensitive to the number of words in a topic). So whether you find a low or high score, I would advise you to look at the topics yourself and see if they make sense to you.

import gensim.corpora as corpora
from gensim.models.coherencemodel import CoherenceModel

# Preprocess documents
cleaned_docs = topic_model._preprocess_text(docs)

# Extract vectorizer and tokenizer from BERTopic
vectorizer = topic_model.vectorizer_model
tokenizer = vectorizer.build_tokenizer()

# Extract features for Topic Coherence evaluation
words = vectorizer.get_feature_names()
tokens = [tokenizer(doc) for doc in cleaned_docs]
dictionary = corpora.Dictionary(tokens)
corpus = [dictionary.doc2bow(token) for token in tokens]
topic_words = [[words for words, _ in topic_model.get_topic(topic)] 
               for topic in range(len(set(topics))-1)]

# Evaluate
coherence_model = CoherenceModel(topics=topic_words, 
coherence = coherence_model.get_coherence()

Hello Maarten, I tried to execute this code, but it just gave me the "raise ValueError('unable to interpret topics either a list of tokens or a list of ids')
ValueError: unable to interpret topic as either list of tokens or a list of ids"

I was tuning the hyperparameters top_n_words and min_topic_size. I basically use the above code as a function to evaluate my topic model quality. It seems that the code does not work for a certain set of values of the two parameters(in my case, it's top_n_words = 5 and min_topic_size =28), while it managed to provide the coherence score for the rest of the pairs.

It's even more peculiar because I'd executed the same thing the other day and there was no issue. The only difference here is I used to a different set of data, although they were preprocessed similarly and had identical structure.

MaartenGr commented 2 years ago

It might be worthwhile to check the differences in output between the output variables for your two sets of data (e.g., topic_words, corpus, etc.). If all parameters are the same but the only thing you changed is the data, then there might be something happening with the results that you get from training on that data. So checking things like the topics and their representation might help you understand what is happening there. For example, it might be the case that you have too few topics generated for it to calculate the coherence.

hwrightson commented 2 years ago

Good afternoon Maarten,

Thank you very much for pulling this together, I recognise that coherence score isn't necessarily the best option to determine accuracy, but it's a useful proxy to consider. Having taken a brief look at the code I've notice that:

words = vectorizer.get_feature_names()

Isn't referred to elsewhere in the code, can this line be omitted or does it serve a further purpose?

Thanks in advance, H

MaartenGr commented 2 years ago

@hwrightson You are completely right! It is definitely a useful proxy to consider when validating your model. NPMI, for example, has shown promise in emulating human performance (1). A topic coherence score in conjunction with visual checks definitely prevents issues later on.

Isn't referred to elsewhere in the code, can this line be omitted or does it serve a further purpose?

Good catch, I might have used it for something else whilst testing out calculating coherence scores. So yes, you can omit that line!

drob-xx commented 2 years ago

@MaartenGr I've been delving into model evaluation and, at your suggestion, am using OCTIS. In my first set of experiments I compared the OCTIS metrics for topic diversity, inverted rbo, and npmi coherence. The results I got for inverted rbo seem promising, the others noisy. As you've clearly explained the choice of metric is highly dependent on the use case. I've begun looking for resources for more information on topic model evaluation metrics and am wondering if you have any suggestions? Two papers I found helpful were A review of topic modeling methods and Measuring LDA topic stability from clusters of replicated runs. As you know OCTIS contains over twenty different metrics. Some I'm familiar with, but most not. As far as I can tell they don't provide references for their implementations. Thanks as always in advance!

P.S. Of course right after writing this I remembered that I hadn't gone back to the paper the OCTIS people wrote OCTIS: Comparing and Optimizing Topic models is Simple!!. So anything you suggest that is not referenced there would be super.

MaartenGr commented 2 years ago

@drob-xx Great to hear that you have been working with OCTIS! You might have already seen it, but aside from in the paper itself, some of the references to the evaluation metrics can be found here.

The field of evaluation metrics is a tricky one, there are many different use cases for topic modeling techniques, and topic modeling, by nature, is a subjective method that is often reflected in the evaluation metrics. Over the last years, there have been several papers describing the pros and cons of these metrics:

  title={Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality},
  author={Lau, Jey Han and Newman, David and Baldwin, Timothy},
  booktitle={Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics},

  title={Optimizing semantic coherence in topic models},
  author={Mimno, David and Wallach, Hanna and Talley, Edmund and Leenders, Miriam and McCallum, Andrew},
  booktitle={Proceedings of the 2011 conference on empirical methods in natural language processing},

  title={Exploring the space of topic coherence measures},
  author={R{\"o}der, Michael and Both, Andreas and Hinneburg, Alexander},
  booktitle={Proceedings of the eighth ACM international conference on Web search and data mining},

  title={An analysis of the coherence of descriptors in topic modeling},
  author={O’callaghan, Derek and Greene, Derek and Carthy, Joe and Cunningham, P{\'a}draig},
  journal={Expert Systems with Applications},

P.S. Of course right after writing this I remembered that I hadn't gone back to the paper the OCTIS people wrote OCTIS: Comparing and Optimizing Topic models is Simple!!. So anything you suggest that is not referenced there would be super.

That has happened to me more times than I would like to admit! The metrics that you find in the paper and in OCTIS are, at least in my experience, the most common metrics that you see in academia. Especially NPMI and Topic Diversity are frequently used metrics as a proxy of the "quality" of these topic modeling techniques.

One thing that might be interesting to look at is clustering metrics. Essentially, BERTopic is a clustering algorithm with a topic representation on top. The assumption here is that good clusters lead to good topic representations. Thus, in order to have a good model, you will need good clusters. You can find some of these metrics here but be aware that some of these might need labels to judge the quality of the generated clusters.

juli-sch commented 2 years ago

Hello Maarten, I would also like to include Octis in my evaluation of BERTopic's findings. If I understand you correctly in Issues #144 and #331, the following lines should give me the topic-word-matrix I need for Octis:

topic_word_matrix = topic_model.c_tf_idf.toarray()
topic_word_matrix = np.delete(topic_word_matrix, obj=0, axis=0)

Is that correct?

When I initialise BERTopic with topic_diversity=None MMR is not used and the c-TF-IDF then is fully representative of the topic representation. Is this assumption correct?

Many thanks in advance for the help

MaartenGr commented 2 years ago

@juli-sch Yes, you can use topic_model.c_tf_idf to be used as the topic-word matrix. Do note, that you only need to use the topic-word matrix for topic significance I believe and it is not necessary for calculating topic coherence scores. For those, you only need the top n words per topic.

Also, make sure not to use the -1 topic as that strictly is not a topic.

wuyoscar commented 2 years ago

@PoonooP I have the same issue ("raise ValueError('unable to interpret topics either a list of tokens or a list of ids').

But finally fixed it. Bertopic has the default parameter top_n_words = 10, which will produce empty topic_words as many as 10.

Below code works for me. (add if words!='' ) [words for words, _ in topic_model.get_topic(topic) if words!='']

The complete code is below:

def calculuate_coherence_score(topic_model ):  
  topic_words = topic_words = [[words for words, _ in topic_model.get_topic(topic) if words!=''] 
               for topic in range(len(set(topics))-1)]
  vectorizer = topic_model.vectorizer_model
  tokenizer = vectorizer.build_tokenizer()
  tokens = [doc.split() for doc in clean_docs]
  dictionary = corpora.Dictionary(tokens)
  corpus = [dictionary.doc2bow(token) for token in tokens]

  coherence_model = CoherenceModel(topics=topic_words, 
  coherence = coherence_model.get_coherence()

  return coherence


Keep in mind, coherence is not a perfect metric for measuring the performance of Topic model. In my findings, varying mesurement has different sweets :)!

justinchuntingho commented 2 years ago

Thank you for your detailed explanation, @MaartenGr. I think it would be very useful for other users if you could add the above recommendations into the FAQ (eg. "How do I evaluate a topic model?), I believe this is one of the questions that puzzle many users (including myself).

jacobceles commented 1 year ago

Good catch, I did not test for higher n-grams in the example. I made two changes:

  • Used the build_analyzer() instead of build_tokenizer() which allows for n-gram tokenization
  • Preprocessing is now based on a collection of documents per topic, since the CountVectorizer was trained on that data

Tested it with several ranges of n-grams and it seems to work now.

from bertopic import BERTopic
import gensim.corpora as corpora
from gensim.models.coherencemodel import CoherenceModel

topic_model = BERTopic(verbose=True, n_gram_range=(1, 3))
topics, _ = topic_model.fit_transform(docs)

# Preprocess Documents
documents = pd.DataFrame({"Document": docs,
                          "ID": range(len(docs)),
                          "Topic": topics})
documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
cleaned_docs = topic_model._preprocess_text(documents_per_topic.Document.values)

# Extract vectorizer and analyzer from BERTopic
vectorizer = topic_model.vectorizer_model
analyzer = vectorizer.build_analyzer()

# Extract features for Topic Coherence evaluation
words = vectorizer.get_feature_names()
tokens = [analyzer(doc) for doc in cleaned_docs]
dictionary = corpora.Dictionary(tokens)
corpus = [dictionary.doc2bow(token) for token in tokens]
topic_words = [[words for words, _ in topic_model.get_topic(topic)] 
               for topic in range(len(set(topics))-1)]

# Evaluate
coherence_model = CoherenceModel(topics=topic_words, 
coherence = coherence_model.get_coherence()

Hi @MaartenGr, while running this code I encountered the error:

/opt/conda/lib/python3.7/site-packages/numpy/core/ RuntimeWarning:
Mean of empty slice.
/opt/conda/lib/python3.7/site-packages/numpy/core/ RuntimeWarning:
invalid value encountered in double_scalars

and therefore, the coherence value is nan. Do you know why this may have happened?

MaartenGr commented 1 year ago

It might be that some of the terms in the topics are empty due to the presence of empty documents. I believe it would help removing those documents are you can check the topic_words to see if some of the terms are empty.

jiezhou94 commented 1 year ago

Hello Maarten, when I increase the number of topics in bertopic, the coherence score decreases. Is there any reason for this?

Many thanks in advance.

MaartenGr commented 1 year ago

@jiezhou94 That depends on a number of things, including the coherence measure that you are using. One coherence measure behaves differently from another. What often happens though is that with more topics, topics are created that are less frequent in general which might impact the coherence. I would advise reading through the coherence formulas of the one that you are using and their corresponding papers to get a more intuitive feeling.

jhuang2023 commented 1 year ago

Dear @MaartenGr , thank you for your work. The discussions on this page are extremely helpful since how to choose the best number of topics is an inevitable question in topic modeling tasks. Would you consider creating a post specifically talking about the selection of topic numbers with BERTopic? it will be greater for the beginners.

MaartenGr commented 1 year ago

@pepperamy Unfortunately, there is not a single right way to choose the number of topics within BERTopic which makes it quite difficult to create a dedicated post for that. More specifically, it depends on a number of things, including the evaluation metric that you use, the clustering algorithm, the use case, human evaluation, clustering performance, etc. I am a little bit afraid that such an article would essentially provide a non-answer since there is no fixed way to do this.

drob-xx commented 1 year ago

One of the reasons I wrote TopicTuner was to aid in the selection of topic model size. By iterating through a number of different cluster configurations you can generally get a visual feedback of what might make sense for your data. BERTopic of course has a built-in function for visualizing embeddings - however, my intent with TopicTuner was to provide a lighter weight solution that allowed a user to quickly iterate through many different topic sizes to see what worked best.

paulacanva commented 1 year ago
tokens = [tokenizer(doc) for doc in cleaned_docs]

Even using this piece of code I still get the same error others mentioned. I'm not sure what I'm missing.

"unable to interpret topic as either a list of tokens or a list of ids"

MaartenGr commented 1 year ago

@paulacanva It is difficult to say without seeing the entire error message but it might be worthwhile to check what kind of types gensim.models.CoherenceModel is expecting. tokens might not be a list of of list of strings but something different. Diving into gensims API could help here.

lovemyday commented 1 year ago
tokens = [tokenizer(doc) for doc in cleaned_docs]

Even using this piece of code I still get the same error others mentioned. I'm not sure what I'm missing.

"unable to interpret topic as either a list of tokens or a list of ids"

Hi, I met the same error, which was caused by the empty topic words. Some topics may have empty top N words for some reasons. Delete such empty topics helped solve this problem in my case.

zhimin-z commented 1 year ago

I have encountered the same issue: unable to interpret topic as either a list of tokens or a list of ids, wondering what to do with the model? @MaartenGr The code:

from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models.coherencemodel import CoherenceModel
from bertopic.vectorizers import ClassTfidfTransformer
from sentence_transformers import SentenceTransformer
from bertopic.representation import KeyBERTInspired
from bertopic import BERTopic
from hdbscan import HDBSCAN
from umap import UMAP

import gensim.corpora as corpora
import pandas as pd
import wandb
import os

os.environ["TOKENIZERS_PARALLELISM"] = "true"
path_dataset = os.path.join(os.path.dirname(os.getcwd()), 'Dataset')

wandb_project = 'asset-management-project'

df_all = pd.read_json(os.path.join(path_dataset, 'all_original.json'))
docs = df_all['Challenge_original_content_gpt_summary'].tolist()

# set general sweep configuration
sweep_configuration = {
    "name": "experiment-2",
    "metric": {
        'name': 'CoherenceCV',
        'goal': 'maximize'
    "method": "grid",
    "parameters": {
        'n_neighbors': {
            'values': list(range(10, 110, 10))
        'n_components': {
            'values': list(range(2, 12, 2))
        'ngram_range': {
            'values': list(range(3, 6))

# set default sweep configuration
config_defaults = {
    'model_name': 'all-mpnet-base-v2',
    'metric_distane': 'manhattan',
    'low_memory': True,
    'max_cluster_size': 1500,
    'min_cluster_size': 50,
    'stop_words': 'english',
    'reduce_frequent_words': True

def train():
    # Initialize a new wandb run
    with wandb.init() as run:
        # update any values not set by sweep

        # Step 1 - Extract embeddings
        embedding_model = SentenceTransformer(run.config.model_name)

        # Step 2 - Reduce dimensionality
        umap_model = UMAP(n_neighbors=wandb.config.n_neighbors, n_components=wandb.config.n_components,
                          metric=run.config.metric_distane, low_memory=run.config.low_memory)

        # Step 3 - Cluster reduced embeddings
        hdbscan_model = HDBSCAN()

        # Step 4 - Tokenize topics
        vectorizer_model = TfidfVectorizer(
            stop_words=run.config.stop_words, ngram_range=(1, wandb.config.ngram_range))

        # Step 5 - Create topic representation
        ctfidf_model = ClassTfidfTransformer(

        # Step 6 - Fine-tune topic representation
        representation_model = KeyBERTInspired()

        # All steps together
        topic_model = BERTopic(
            # Step 7 - Track model stages
            # verbose=True

        topics, _ = topic_model.fit_transform(docs)

        # Preprocess documents
        documents = pd.DataFrame(
            {"Document": docs,
             "ID": range(len(docs)),
             "Topic": topics}
        documents_per_topic = documents.groupby(
            ['Topic'], as_index=False).agg({'Document': ' '.join})
        cleaned_docs = topic_model._preprocess_text(

        # Extract vectorizer and analyzer from fit model
        analyzer = vectorizer_model.build_analyzer()
        # Extract features for topic coherence evaluation
        tokens = [analyzer(doc) for doc in cleaned_docs]
        dictionary = corpora.Dictionary(tokens)
        corpus = [dictionary.doc2bow(token) for token in tokens]
        topic_words = [[words for words, _ in topic_model.get_topic(topic)]
                       for topic in range(len(set(topics))-1)]

        coherence_cv = CoherenceModel(

        coherence_umass = CoherenceModel(

        coherence_cuci = CoherenceModel(

        coherence_cnpmi = CoherenceModel(

        wandb.log({'CoherenceCV': coherence_cv.get_coherence()})
        wandb.log({'CoherenceUMASS': coherence_umass.get_coherence()})
        wandb.log({'CoherenceUCI': coherence_cuci.get_coherence()})
        wandb.log({'CoherenceNPMI': coherence_cnpmi.get_coherence()})

sweep_id = wandb.sweep(sweep_configuration, project=wandb_project)
# Create sweep with ID: j7pnz7gn
wandb.agent(sweep_id=sweep_id, function=train)

Partial topic output: image image Is there anything to do with topic 8? How to avoid such an issue? This has bothered me for quite a while... image

MaartenGr commented 1 year ago

@zhimin-z The topic 8 most likely contains documents that are all empty or will be empty after doing the processing with the CountVectorizer. Personally, I would just remove that topic if those documents are all empty or nearly-empty.

zhimin-z commented 1 year ago

@zhimin-z The topic 8 most likely contains documents that are all empty or will be empty after doing the processing with the CountVectorizer. Personally, I would just remove that topic if those documents are all empty or nearly-empty.

Thanks for your advice. I wonder what it means to remove a topic since I fail to find the function to redistribute the documents that share empty topics into other nonempty topics. Is there any code for reference? @MaartenGr

MaartenGr commented 1 year ago

@zhimin-z You can remove the empty documents beforehand as they do not contribute to any actual topic and then re-run BERTopic. You can also use .merge_topics to merge the empty topic with a non-empty topic. Having said that, I would advise simply not passing that topic to the Coherence function as the topic itself is simply empty.

zhimin-z commented 1 year ago


Thanks for your suggestion, @MaartenGr But there is a controversy, I am currently doing a hyperparameter sweep using BerTopic. I found the documents vary depending on bertopic hyperparameters, thus I found it literally hard to pre-remove the documents before completing the modeling. The only thing I can think of is to save the empty-topic documents for each hyperparameter sets and rerun an extra round of hyperparameter sweep. But this quite prolongs the experimentation period due to hundreds of hyperparameter sweeps.

Also, you mentioned merge_topics, but this seems to require us to know exactly which topic the documents belong to. I wonder how to know exactly which specific topics for those empty-topic documents to merge into in my case. Any feasible solution? image

MaartenGr commented 1 year ago

@zhimin-z I was actually suggesting removing the empty documents before training your model. Since those documents are empty, they do not contribute to the training process, and removing them should be rather straightforward.

With respect to the merging of topics, you can find topics containing all empty documents by finding that the keywords of those topics are empty. In other words, if you simply loop over .get_topic and check whether the first 5 keywords are empty, you can find which topics contain empty documents.

zhimin-z commented 1 year ago

Hi @MaartenGr Thanks for your reply, but as I mentioned in earlier feedback. I have more than 300 experiments in the hyperparameter optimiztion sweep and each of them gives a different document set that has an empty topic. It becomes infeasible to remove them prior to each experiment since I have to know which documents to remove after conducting all the experiments at least once. What could I do in this case?

MaartenGr commented 1 year ago

@zhimin-z What I meant here is that the documents, before using them in BERTopic, are likely to already be empty or near-empty. You can find those documents, for example, by removing all documents that have a length lower than 5. That way, there is a good chance that those documents will be removed before doing the 300 experiments.

The important thing here is that there is a reason those documents are generating empty topics and that is likely because they are either quite small (<10 characters) or are simply empty. Identifying those documents will be trivial if the previous notions are true as you can simply do [doc for doc in docs if len(doc) > 10] or anything else that will remove those documents. To me, that seems the easiest option.

However, if you cannot do it beforehand and need to do it during the experiments, then I would advise doing what I mentioned above. You could detect those empty topics by identifying which topics has no keywords and then either not pass them to the Coherence formula or merging them with another topic. You can detect whether a topic has no keywords by looping over topic_model.get_topic() and then checking whether the very first keyword is empty. If the first keyword is empty, all others will also be and therefore the documents will be empty.

meh369 commented 1 year ago

FYI, @zhimin-z, @MaartenGr , @wuyoscar I made a modification to the code to avoid the "unable to interpret topic as either a list of tokens or a list of ids" error by creating a corpora dictionary from tokens and then creating a list of token IDs using token2id. This approach does not require using the list of lists from topic_model.get_topic(topic) as the topic_words.

from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora.dictionary import Dictionary
from gensim import corpora

def calculate_coherence_score(topic_model, docs):
    # Preprocess documents
    cleaned_docs = topic_model._preprocess_text(docs)

    # Extract vectorizer and tokenizer from BERTopic
    vectorizer = topic_model.vectorizer_model
    tokenizer = vectorizer.build_tokenizer()

    # Extract features for Topic Coherence evaluation
    words = vectorizer.get_feature_names_out()
    tokens = [tokenizer(doc) for doc in cleaned_docs]
    dictionary = corpora.Dictionary(tokens)
    corpus = [dictionary.doc2bow(token) for token in tokens]
    # Create topic words
    topic_words = [[dictionary.token2id[w] for w in words if w in dictionary.token2id]
    for _ in range(topic_model.nr_topics)]

    # this creates a list of the token ids (in the format of integers) of the words in words that are also present in the 
    # dictionary created from the preprocessed text. The topic_words list contains list of token ids for each 
    # topic.

    coherence_model = CoherenceModel(topics=topic_words,
    coherence = coherence_model.get_coherence()

    return coherence
calculate_coherence_score(topic_model=bert_model, docs=sentence_tokenized_text)
MaartenGr commented 1 year ago

@zhimin-z I would like to keep this issue open. It is a topic that gets discussed quite frequently and this thread gives a nice overview of different solutions for calculating coherence but also the pros and cons of using such a method.