ctfidf breaks down when specifying a vocabulary in CountVectorizer

dannywhuang commented 8 months ago

In some cases, the stop_words parameter of the CountVectorizer is not enough to prevent certain non-desired words from coming through. For example, one may have the desire to filter out non-verbs like abbreviations before coming up with topic representations.

This can be done by specifying a vocubulary in the CountVectorizer object sklearn docs

However, a problem that occurs then is that ctfidf breaks down due to division by zero in line 82 of _ctfidf.py: idf = np.log((avg_nr_samples / df)+1) because it could be that some words in the vocabulary actually never occur.

I would therefore propose to change the line above to idf = np.log((avg_nr_samples / np.maximum(df, 1))+1)

This solution does not change behaviour in normal cases and gives the optionality to specify a vocabulary when creating topic representations

MaartenGr commented 8 months ago

Thanks for the issue and PR. Before I check it out, do you perhaps have a reproducible example? That way, I can verify the issue. Also, what would be the impact of your change on the wall time and output? Does your change influence a regular run?

zilch42 commented 7 months ago

Hi @MaartenGr, I've been seeing this warning a lot too. I think it's relevant to the way I ended up working after the discussion in #1665 so this example should be relevant.

from sklearn.datasets import fetch_20newsgroups
from keybert import KeyBERT
import numpy as np
import re
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer

# Prepare documents 
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

def preprocess_text(documents: np.ndarray):
        """ Basic preprocessing of text

        Steps:
            * Replace \n and \t with whitespace
            * Only keep alpha-numerical characters
        """
        cleaned_documents = [doc.replace("\n", " ") for doc in documents]
        cleaned_documents = [doc.replace("\t", " ") for doc in cleaned_documents]
        cleaned_documents = [re.sub(r'[^A-Za-z0-9 ]+', '', doc) for doc in cleaned_documents]
        cleaned_documents = [doc if doc != "" else "emptydoc" for doc in cleaned_documents]
        return cleaned_documents

docs = preprocess_text(docs)

pre_vectorizer_model = CountVectorizer(min_df=10, ngram_range=(1,3), stop_words="english")
pre_vectorizer_model.fit(docs)
vocabulary = list(set(pre_vectorizer_model.vocabulary_.keys()))

vectorizer_model= CountVectorizer(vocabulary=vocabulary)
topic_model = BERTopic(vectorizer_model=vectorizer_model, verbose=True)
topics, probs = topic_model.fit_transform(docs)

2024-01-23 11:11:37,358 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%
589/589 [00:17<00:00, 102.73it/s]
2024-01-23 11:11:55,788 - BERTopic - Embedding - Completed ✓
2024-01-23 11:11:55,789 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-01-23 11:12:06,768 - BERTopic - Dimensionality - Completed ✓
2024-01-23 11:12:06,770 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-01-23 11:12:10,754 - BERTopic - Cluster - Completed ✓
2024-01-23 11:12:10,760 - BERTopic - Representation - Extracting topics from clusters using representation models.
[c:\path\lib\site-packages\bertopic\vectorizers\_ctfidf.py:82](file:///C:/path/envs/myenv/lib/site-packages/bertopic/vectorizers/_ctfidf.py:82): RuntimeWarning: divide by zero encountered in divide
  idf = np.log((avg_nr_samples / df)+1)
2024-01-23 11:12:14,768 - BERTopic - Representation - Completed ✓

I haven't run it through the PR yet though.

Note that setting the ngram_range in the pre_vectorizer_model seems to be required to produce the warning.

MaartenGr / BERTopic

ctfidf breaks down when specifying a vocabulary in CountVectorizer #1711