MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.04k stars 757 forks source link

About Coherence of topic models #90

Open nadiafelix opened 3 years ago

nadiafelix commented 3 years ago

Currently, I am calculating the Coherence of a bertopic model using the gensim. For this I need the n_grams from each text of the corpus. Is it possible? The function used by gensim waits for the corpus and topics, and the topics are tokens that must exist in corpus.

cm = CoherenceModel(topics, corpus, dictionary, coherence='u_mass')

Thanks in advance.

zhimin-z commented 1 year ago

FYI, @zhimin-z, @MaartenGr , @wuyoscar I made a modification to the code to avoid the "unable to interpret topic as either a list of tokens or a list of ids" error by creating a corpora dictionary from tokens and then creating a list of token IDs using token2id. This approach does not require using the list of lists from topic_model.get_topic(topic) as the topic_words.

from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora.dictionary import Dictionary
from gensim import corpora

def calculate_coherence_score(topic_model, docs):
    # Preprocess documents
    cleaned_docs = topic_model._preprocess_text(docs)

    # Extract vectorizer and tokenizer from BERTopic
    vectorizer = topic_model.vectorizer_model
    tokenizer = vectorizer.build_tokenizer()

    # Extract features for Topic Coherence evaluation
    words = vectorizer.get_feature_names_out()
    tokens = [tokenizer(doc) for doc in cleaned_docs]
    dictionary = corpora.Dictionary(tokens)
    corpus = [dictionary.doc2bow(token) for token in tokens]
    # Create topic words
    topic_words = [[dictionary.token2id[w] for w in words if w in dictionary.token2id]
    for _ in range(topic_model.nr_topics)]

    # this creates a list of the token ids (in the format of integers) of the words in words that are also present in the 
    # dictionary created from the preprocessed text. The topic_words list contains list of token ids for each 
    # topic.

    coherence_model = CoherenceModel(topics=topic_words,
                                    texts=tokens,
                                    corpus=corpus,
                                    dictionary=dictionary,
                                    coherence='c_v')
    coherence = coherence_model.get_coherence()

    return coherence
calculate_coherence_score(topic_model=bert_model, docs=sentence_tokenized_text)

Hi, thanks for your code. @meh369 This piece of code seems not to work. I deployed it with W&B and found there still exist empty topics in each run: image

Any other suggestions? @meh369 @MaartenGr

MaartenGr commented 1 year ago
topic_words = [[dictionary.token2id[w] for w in words if w in dictionary.token2id] for _ in range(topic_model.nr_topics)]

@meh369 This does not create topic words per topic but multiple identical lists of tokens, so I do not think the model is correctly evaluated here.

In the code I mentioned [here](), there is the following line that you can adjust to skip topics that only contain empty values:

topic_words = [[words for words, _ in topic_model.get_topic(topic)]
               for topic in range(len(set(topics))-1)]

What you want here is to make sure that two things are prevented:

First, let's create a reproducible topic model that has some topics that topics that contain empty words

from umap import UMAP
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer

# Prepare embeddings
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
docs = [doc for doc in docs if len(doc) >= 10]
docs += ["the"] * 100
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=True)

# Train topic model
vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 2))
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)

topic_model = BERTopic(umap_model=umap_model, vectorizer_model=vectorizer_model, verbose=True, min_topic_size=50)
topics, probs = topic_model.fit_transform(docs, embeddings)

Now, we can start calculating the coherence score and making sure that empty words are not passed to the CoherenceModel as well as topics that do not contain any words:

from bertopic import BERTopic
import gensim.corpora as corpora
from gensim.models.coherencemodel import CoherenceModel
import pandas as pd

# Preprocess Documents
documents = pd.DataFrame({"Document": docs,
                          "ID": range(len(docs)),
                          "Topic": topics})
documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
cleaned_docs = topic_model._preprocess_text(documents_per_topic.Document.values)

# Extract vectorizer and analyzer from BERTopic
vectorizer = topic_model.vectorizer_model
analyzer = vectorizer.build_analyzer()

# Use .get_feature_names_out() if you get an error with .get_feature_names()
words = vectorizer.get_feature_names()

# Extract features for Topic Coherence evaluation
tokens = [analyzer(doc) for doc in cleaned_docs]
dictionary = corpora.Dictionary(tokens)
corpus = [dictionary.doc2bow(token) for token in tokens]

# Extract words in each topic if they are non-empty and exist in the dictionary
topic_words = []
for topic in range(len(set(topics))-topic_model._outliers):
    words = list(zip(*topic_model.get_topic(topic)))[0]
    words = [word for word in words if word in dictionary.token2id]
    topic_words.append(words)
topic_words = [words for words in topic_words if len(words) > 0]

# Evaluate Coherence
coherence_model = CoherenceModel(topics=topic_words, 
                                 texts=tokens, 
                                 corpus=corpus,
                                 dictionary=dictionary, 
                                 coherence='c_v')
coherence = coherence_model.get_coherence()
zhimin-z commented 1 year ago

get_feature_names_out

Thanks for your example, @MaartenGr I run it immediately but found there is an exception: image

MaartenGr commented 1 year ago

Ah yes, it should either be .get_feature_names() or .get_feature_names_out() depending on your scikit-learn version.

meh369 commented 1 year ago

@MaartenGr , Thank you so much for testing and pointing out the mistake in time. - I'm still learning, so I really appreciate your help!

zhimin-z commented 1 year ago

Ah yes, it should either be .get_feature_names() or .get_feature_names_out() depending on your scikit-learn version.

Thanks so much! @MaartenGr This piece of code indeed solves the empty topics issues that is torturing me for quite a while.

RamziRahli commented 1 year ago

Hello @MaartenGr, I want to use Bertopic on my data but I'm hesitating between 3 embedding Models. I'm trying to use the evaluation provided Here and OCTIS to calculate the diversity and coherence of each model but I failed. Could you provide me with an example of how I could do this, possibly using cuML please. Thank you !

MaartenGr commented 1 year ago

@RamziRahli That repo was merely for the evaluation of experiments in the paper and was not meant to be generally used. Instead, I would advise performing the evaluations yourself using the guidelines in OCTIS or using Gensim with the provided example here.

RamziRahli commented 1 year ago

@RamziRahli That repo was merely for the evaluation of experiments in the paper and was not meant to be generally used. Instead, I would advise performing the evaluations yourself using the guidelines in OCTIS or using Gensim with the provided example here.

@MaartenGr I tried to calculate the consistency on 500K relatively short document (150 character maximum) as in the example but it takes more than 24H, is this normal?

MaartenGr commented 1 year ago

@RamziRahli That is difficult to say without seeing the actual code (and feel free to create an issue for this) but it would not be unsurprising depending on your setup. Calculating coherence measures is notoriously slow.

rizkiamandaputri commented 1 year ago

Hello Everyone! I just wanna ask a question. I tried to print out the bertopic's coherence score into interface but I got error : 'numpy.float64' object has no attribute 'get_coherence' . And here is my code :

documents = pd.DataFrame({"Document": texts,
                      "ID": range(len(texts)),
                      "Topic": topics})
documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
cleaned_docs = topic_model_n._preprocess_text(documents_per_topic.Document.values)

# Extract vectorizer and analyzer from BERTopic
vectorizer = topic_model_n.vectorizer_model
analyzer = vectorizer.build_analyzer()

# Extract features for Topic Coherence evaluation
words = vectorizer.get_feature_names_out()
tokens = [analyzer(doc) for doc in cleaned_docs]
dictionary = corpora.Dictionary(tokens)
corpus = [dictionary.doc2bow(token) for token in tokens]
topic_words = [[words for words, _ in topic_model.get_topic(topic)] 
               for topic in range(len(set(topics))-1)]
# topic_words = [[dictionary.token2id[w] for w in words if w in dictionary.token2id]
# for _ in range(topic_model_n.nr_topics)]

# Evaluate
coherence_cv = CoherenceModel(topics=topic_words, 
                                 texts=tokens, 
                                 corpus=corpus,
                                 dictionary=dictionary, 
                                 coherence='c_v')
coherence = coherence_cv.get_coherence()

# Print Data Evaluation
topic_eval = coherence.get_coherence()

res = topic_eval.to_json(orient="records")
parsed = json.loads(res)
json_topic_evaluation = parsed

How to solve this error? Explain to me how to solve it. Thank you.

MaartenGr commented 1 year ago

You should not run coherence.get_coherence() since coherence is already the result. In other words, remove the following:

# Print Data Evaluation
topic_eval = coherence.get_coherence()
rizkiamandaputri commented 1 year ago

You should not run coherence.get_coherence() since coherence is already the result. In other words, remove the following:

# Print Data Evaluation
topic_eval = coherence.get_coherence()

I got same error like before : 'numpy.float64' object has no attribute 'to_json', this is the code :

coherence_cv = CoherenceModel(topics=topic_words, 
                                 texts=tokens, 
                                 corpus=corpus,
                                 dictionary=dictionary, 
                                 coherence='c_v')
coherence = coherence_cv.get_coherence()

# Print Data Evaluation
res = coherence.to_json(orient="records")
parsed = json.loads(res)
json_topic_evaluation = parsed
MaartenGr commented 1 year ago

The type of coherence is a numpy.float64 which means it is just a single value. If you want to save that single value as json, you would have to check yourself how to save a numpy float to json. Also, since it is a numpy.float64 it does not have a to_json function. I would advise checking a few tutorials on using json in python.

mike-bmnn commented 12 months ago

@MaartenGr Is it generally a good or bad idea to use a representation model while evaluating the coherence score of a model? I noticed that using KeyBERTInspired while evaluating the coherence score yields different results than using none. Although I have to say that the different scores are still very similar.

MaartenGr commented 12 months ago

@mike-asw It depends. If the representation model that you use is important for your use case, then you should definitely include it in the evaluation. The multiple scores also give you an idea of the effect of representation models on the resulting coherence evaluation metric.

I do think that when you include representation models and you run evaluation metrics, you should definitely include these representation models in the evaluation procedure. It always surprised me that when evaluating BERTopic, many users/researchers tend to focus on only the base representation when there are so many more to choose from.

ninavdPipple commented 10 months ago

Hi Maarten,

I was looking at the discussion above and figured at some point you switch from the tokenizer to the analyzer in order to be able to perform n_gram tokenization. In my code both implementations seem to work, however they give very different coherence values. I do specify a n_gram range in my CountVectorizer. Which of the two (tokenizer or analyzer) will give the ‘correct’ coherence value in my case, if such a notion even exists? Or what should be considered in picking one of the two?

Thanks in advance!

MaartenGr commented 10 months ago

@ninavdPipple As you mentioned, there is no "correct" coherence value. It all depends on the reasons why you would choose the tokenizer over the analyzer or vice versa. Having said that, since you are using ngram_range it makes sense to choose the one that actually supports n-grams. If the differences are large, then it might be worthwhile to research why that may be the case and mention that in your research.

benearnthof commented 10 months ago
topic_words = [[dictionary.token2id[w] for w in words if w in dictionary.token2id] for _ in range(topic_model.nr_topics)]

@meh369 This does not create topic words per topic but multiple identical lists of tokens, so I do not think the model is correctly evaluated here.

In the code I mentioned here, there is the following line that you can adjust to skip topics that only contain empty values:

topic_words = [[words for words, _ in topic_model.get_topic(topic)]
               for topic in range(len(set(topics))-1)]

What you want here is to make sure that two things are prevented:

* Passing words that are not found in the dictionary

  * These are typically empty words

* Topics are completely empty

First, let's create a reproducible topic model that has some topics that topics that contain empty words

from umap import UMAP
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer

# Prepare embeddings
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
docs = [doc for doc in docs if len(doc) >= 10]
docs += ["the"] * 100
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=True)

# Train topic model
vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 2))
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)

topic_model = BERTopic(umap_model=umap_model, vectorizer_model=vectorizer_model, verbose=True, min_topic_size=50)
topics, probs = topic_model.fit_transform(docs, embeddings)

Now, we can start calculating the coherence score and making sure that empty words are not passed to the CoherenceModel as well as topics that do not contain any words:

from bertopic import BERTopic
import gensim.corpora as corpora
from gensim.models.coherencemodel import CoherenceModel
import pandas as pd

# Preprocess Documents
documents = pd.DataFrame({"Document": docs,
                          "ID": range(len(docs)),
                          "Topic": topics})
documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
cleaned_docs = topic_model._preprocess_text(documents_per_topic.Document.values)

# Extract vectorizer and analyzer from BERTopic
vectorizer = topic_model.vectorizer_model
analyzer = vectorizer.build_analyzer()

# Use .get_feature_names_out() if you get an error with .get_feature_names()
words = vectorizer.get_feature_names()

# Extract features for Topic Coherence evaluation
tokens = [analyzer(doc) for doc in cleaned_docs]
dictionary = corpora.Dictionary(tokens)
corpus = [dictionary.doc2bow(token) for token in tokens]

# Extract words in each topic if they are non-empty and exist in the dictionary
topic_words = []
for topic in range(len(set(topics))-topic_model._outliers):
    words = list(zip(*topic_model.get_topic(topic)))[0]
    words = [word for word in words if word in dictionary.token2id]
    topic_words.append(words)
topic_words = [words for words in topic_words if len(words) > 0]

# Evaluate Coherence
coherence_model = CoherenceModel(topics=topic_words, 
                                 texts=tokens, 
                                 corpus=corpus,
                                 dictionary=dictionary, 
                                 coherence='c_v')
coherence = coherence_model.get_coherence()

Hi, I'm currently using this code to calculate coherence measures for topic models based on arxiv preprints and the line coherence = coherence_model.get_coherence() keeps running out of memory and my python session crashes with the console output "Killed". Did anyone else run into this problem? The problem persists for corpora larger than 12000 documents.

MaartenGr commented 10 months ago

@benearnthof Calculating coherence scores takes a lot of memory and I am not familiar with any more efficient techniques. Making sure you have enough RAM is definitely important here. Also, make sure that your vocab is unnecessarily large when you are using n-grams. The min_df parameter definitely helps here.

benearnthof commented 10 months ago

@MaartenGr I have experimented with mmcorpus but will give min_df a shot, thanks for the swift reply!

abis330 commented 9 months ago

I tried plotting coherence score (c_v) against number of topics where I am changing hyperparameters "n_neighbors", "n_components" for UMAP function passed and "cluster_selection_epsilon", "min_cluster_size" for HDBSCAN function passed to BERTopic.

When I see the nature of graph it shows that it has a monotonically decreasing nature of graph. Shouldn't we expect it either otherwise or have maximum somewhere where before it was increasing upto that point and then it decreases after that point.

It is weird that the coherence score always seemed to be decreasing with increase in number of topics.

I could use some feedback ASAP. @MaartenGr

MaartenGr commented 9 months ago

@abinashsinha330 It is difficult to say without knowing every specific about your data, use case, type of coherence (e.g., c_v vs. npmi), etc. For example, it could simply be that you have little data available for each topic that you add and therefore, the topic representations are not as good as the first few. Of course, this could also depend on the representation model that you choose.

However, after a quick Google search, you can find several papers that not only have this phenomenon but also observe that the coherence score might increase again after a certain point. You can do some research on your chosen coherence score and get an intuition about how it works. Then, you can experiment and research why your specific graphs appear.

Do note that this issue thread is mostly focused on evaluation in general and, as you might have read here, I am generally against such a large focus on only coherence. So my main advice would be not to focus that much on coherence scores only and create a broad evaluation of your topic model. The thought that a topic model should only be evaluated by a coherence score (whatever that exactly means with different metrics) can get you into trouble when using the model in practice.

nickgiann commented 5 months ago

Hi @MaartenGr ,

I noticed that in your provided example for calculating coherence scores, the entire corpus is used for both fitting and evaluation. I'm interested in your perspective on incorporating a train-test split for model assessment. Would this improve the evaluation's robustness by measuring generalizability to unseen data, or might it lead to non-representative coherence scores?

Thanks in advance!

MaartenGr commented 5 months ago

@nickgiann Hmmm, I seldom see train/test splits for that since you would still need to have the same vocabulary used across splits which in turn require the entire corpus to be passed.

The thing is that unseen data does not influence the training of BERTopic and whenever you run .transform it only updates the topic assignment and not the topic representation. So unseen data, at least from that perspective, should not influence the coherence score unless you are looking at incremental topic modeling settings.

romybeaute commented 4 months ago

Dear @MaartenGr, thank you so much for all your useful advices above. Having had compatibilities issues with OCTIS, I am trying to find an alternative way to do hyperparameter tuning (wrt coherence measure). I tried creating a BERTopic Grid Search wrapper, in which i define manually the consistency function :

class BERTopicGridSearchWrapper(BaseEstimator):
    def __init__(self, vectorizer_model, embedding_model, n_neighbors=10, n_components=5, min_dist=0.01, min_cluster_size=10, min_samples=None, top_n_words=5):
        self.vectorizer_model = vectorizer_model
        self.embedding_model = embedding_model
        self.n_neighbors = n_neighbors
        self.n_components = n_components
        self.min_dist = min_dist
        self.min_cluster_size = min_cluster_size
        self.min_samples = min_samples
        self.top_n_words = top_n_words
        self.model = None

    def fit(self, X):

        umap_model = UMAP(n_neighbors=self.n_neighbors, n_components=self.n_components, min_dist=self.min_dist, random_state=77)
        hdbscan_model = HDBSCAN(min_cluster_size=self.min_cluster_size, min_samples=self.min_samples, prediction_data=True)

        self.model = BERTopic(umap_model=umap_model, 
                              hdbscan_model=hdbscan_model,
                              embedding_model=self.embedding_model,
                              vectorizer_model=self.vectorizer_model,
                              top_n_words=self.top_n_words,
                              language='english',
                              calculate_probabilities=True,
                              verbose=True)
        self.model.fit_transform(X)
        return self

    def score(self, X):
        coherence_score = calculate_coherence(self.model, X)
        return coherence_score
def calculate_coherence(topic_model, data):

    topics, _ = topic_model.fit_transform(data)
    # Preprocess Documents
    documents = pd.DataFrame({"Document": data,
                          "ID": range(len(data)),
                          "Topic": topics})
    documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})

    #Extracting the vectorizer and embedding model from BERTopic model
    vectorizer = topic_model.vectorizer_model #CountVectorizer of BERTopic model 
    tokenizer = vectorizer.build_tokenizer()
    analyzer = vectorizer.build_analyzer() #allows for n-gram tokenization

    # Extract features for Topic Coherence evaluation
    words = vectorizer.get_feature_names_out()
    tokens = [tokenizer(doc) for doc in data]
    # tokens = [analyzer(doc) for doc in data]

    dictionary = corpora.Dictionary(tokens)
    corpus = [dictionary.doc2bow(token) for token in tokens]

    topic_words = [[word for word, _ in topic_model.get_topic(topic_id)] for topic_id in range(len(set(topics))-1)]

    print("Topics:", topic_words)
    coherence_model = CoherenceModel(topics=topic_words, 
                                     texts=tokens, 
                                     corpus=corpus,
                                     dictionary=dictionary, 
                                     coherence='c_v')
    coherence_score = coherence_model.get_coherence()
    return coherence_score
However, when I run my gridsearch : 
grid_search = GridSearchCV(BERTopicGridSearchWrapper(vectorizer_model, embedding_model),
                           param_grid=params_grid,
                           cv=None,
                           scoring=make_scorer(calculate_coherence),
                           verbose=10)

# Fit grid search
grid_search.fit(reports_filtered)

print("Best parameters:", grid_search.best_params_)
print("Best coherence score:", grid_search.best_score_)

I keep getting "nan" as my coherence scores ([CV 1/5; 1/8] END min_cluster_size=5, min_dist=0.01, min_samples=None, n_components=3, n_neighbors=3;, **score=nan** total time= 3.4s)

I have been trying to find the source of this issue for a while, and among the debugging attempts, I found that when I use the wrapper alone :

wrapper = BERTopicGridSearchWrapper(vectorizer_model=vectorizer_model, embedding_model=SentenceTransformer('all-MiniLM-L6-v2'))
wrapper.fit(reports_filtered.tolist())  
coherence = wrapper.score(reports_filtered.tolist())
print(coherence)

I obtain a coherence score.

Do you have any idea of what is going on here, and what I might have wrong ? Thank you so much for your attention!

All the best, Romy

MaartenGr commented 4 months ago

@romybeaute Unfortunately, I'm not that familiar with how a customized GridSearchWrapper should be implemented within scikit-learn. You potentially could do it manually since there is no cross-validation involved in your example. It would be looping over parameters and nothing more if I'm not mistaken.

romybeaute commented 3 months ago

Dear @MaartenGr , thanks a lot for your previous answer to my question. I have been applying your advices, and now have a csv file containing the different combinations that have been tested, and their respective coherence score and number of topics (grid_search_results_HandWritten_seed22.csv). But the best coherence results lead to very few topics created. So I am in the situation where I need to find a balance between coherence score and a number of extracted topics that is reasonable for my research. But this choice seems quite subjective... Is it something acceptable to do ? Would you recommend any other - more objective - method to select the number of extracted of topics (and therefore the hyperparameters combinations that lead to this number of extracted topics)? Moreover, would you recommend also doing cross validation with BERTopic ? It was not mentioned in the (amazing) tutorials that you uploaded online, so was wondering how robust are our results if no CV.
Many thanks for your precious help, Romy

MaartenGr commented 3 months ago

@romybeaute

But the best coherence results lead to very few topics created.

That's indeed the problem I have with using topic coherence and grid search together, you are not likely to end up with the quality of topics that you are looking for. As such, and as you can see throughout this issue, I would definitely not recommend grid-searching topic coherence only. It is important to first take note of what "performance" or "quality" means in your specific use case and derive metrics based on that. Topic coherence by itself tells you so little about a topic model, especially when you take into account the other perspectives of what a topic model can be good at, such as assignment of topics, diversity of topics, accuracy of topics rather coherence, etc.

But this choice seems quite subjective... Is it something acceptable to do ? Would you recommend any other - more objective - method to select the number of extracted of topics (and therefore the hyperparameters combinations that lead to this number of extracted topics)?

It is indeed subjective but that is not necessarily a bad thing because your use case is subjective. You have certain requirements for your specific use case and one of which is the number of extracted topics. It would be more than reasonable to say that having 2 topics in your 1 million documents makes no sense and based on your familiarity with the data, there are at least n topics.

If you want a purely objective measure for something that is inherently subjective, that will prove to be quite difficult. Instead, I generally advise a mix. You can use proxy measures such as topic coherence and diversity as the "objective" measures (note they are not ground-truth metrics) and "subjective" information such as limiting the number of topics to a certain minimum.

All in all, I would advise starting from the metric itself. Why is optimizing for only topic coherence so important for your use case?

Moreover, would you recommend also doing cross validation with BERTopic ? It was not mentioned in the (amazing) tutorials that you uploaded online, so was wondering how robust are our results if no CV.

What would be the splits and evaluation here? Normally, you would train on 80% of the data here and perform inference on 20%. In the context of topic coherence, there is no inference involved, only training.