Words not in CountVectorizer vocab despite being well above min_df threshold

zilch42 commented 10 months ago

Hi Maarten,

I'm having an issue with some important words not appearing within CountVectorizer when using min_df even though they are well above the set threshold. My understanding of min_df is that when it is an integer (say, 10), all words that appear in at least 10 documents will be kept. Is that correct? See the example below,

from umap import UMAP
from datasets import load_dataset
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer

dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]
abstracts = dataset["abstract"][:5_000]

umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
vectorizer_model=CountVectorizer(min_df=10)

topic_model = BERTopic(umap_model=umap_model, vectorizer_model=vectorizer_model).fit(abstracts)

Take the word 'cancer' for example.

# this under counts as it doesn't consider words next to punctuation, 
# but wont consider words inside words (e.g. robot in robotics)
len([d for d in abstracts if " cancer " in d.lower()])
>>>  32

'cancer' appears in at least 32 documents (not counting occurrences up against punctuation so should be well above min_df.

topic_model.vectorizer_model.vocabulary_["cancer"]

returns a key error.

Also try words: privacy, reward, quantum, fuzzy, hashing, planning, player, robot, embeddings

If I remove the min_df param and just use CountVectorizer(), then these words are all present.

I first came across this with the word 'rust' which was an important word in a large topic with ~70 document occurrences and min_df=10. It isn't prominent in the arxiv dataset though.

Do you know what is happening here?

MaartenGr commented 10 months ago

Before going into what might causing this, it is worthwhile to instead check whether this is the case when performing the tokenization with the vectorizer model instead. That way, we can be 100% sure about the counts. Could you check the counts instead with build_analyzer and the tokenizer to see whether that indeed does not match?

zilch42 commented 10 months ago

Thanks for the tip.

Is this what you mean?

analyzer = vectorizer_model.build_analyzer()
len([d for d in abstracts if "cancer" in analyzer(d)])
>>> 43

MaartenGr commented 10 months ago

I'm not entirely sure but it can also be a definition issue. If min_df refers to the document frequency, as in the number of documents that contain a certain word, then the counts you did are actually not representative of a "document". In BERTopic, the CountVectorizer is not trained on all documents but on all topics. This means that the documents are first concatenated into a single string for each topic before fitting the CountVectorizer. As a result, the number of documents now refer to the number of topics instead since each document is a single string containing all documents in a topic.

zilch42 commented 10 months ago

Ah, I think that is probably it! Thank you. So if I had a topic model with very good topic separation, and an important descriptive keyword that only appeared in one or two topics, then min_df would not be my friend for controlling the size of the ctf_idf matrix?

I've tried pre filtering the vocabulary so that I can apply the min_df threshold to document occurrences rather than topic occurrences, which I think makes more sense for my use case, but I am running into an error.

from umap import UMAP
from datasets import load_dataset
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer

dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]
abstracts = dataset["abstract"][:5_000]

umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)

# filter vocabulary on document occurrences
pre_vectorizer_model=CountVectorizer(min_df=10, stop_words="english", ngram_range=(1,3))
pre_vectorizer_model.fit(abstracts)
vocabulary = list(set(pre_vectorizer_model.vocabulary_.keys()))

vectorizer_model = CountVectorizer(vocabulary=vocabulary)

topic_model = BERTopic(umap_model=umap_model, vectorizer_model=vectorizer_model).fit(abstracts)

ValueError: Input contains infinity or a value too large for dtype('float64').

I adapted this from the example on KeyBERT in Tips & Tricks, but that example is throwing the same error. https://maartengr.github.io/BERTopic/getting_started/tips_and_tricks/tips_and_tricks.html#keybert-bertopic

from sklearn.datasets import fetch_20newsgroups
from keybert import KeyBERT

# Prepare documents 
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

# Extract keywords
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(docs)

# Create our vocabulary
vocabulary = [k[0] for keyword in keywords for k in keyword]
vocabulary = list(set(vocabulary))

from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_model= CountVectorizer(vocabulary=vocabulary)
topic_model = BERTopic(vectorizer_model=vectorizer_model)
topics, probs = topic_model.fit_transform(docs)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
c:\path\count_vectorizer_issue.ipynb Cell 2 line 2
     18 vectorizer_model= CountVectorizer(vocabulary=vocabulary)
     19 topic_model = BERTopic(vectorizer_model=vectorizer_model)
---> 20 topics, probs = topic_model.fit_transform(docs)

File c:\path\lib\site-packages\bertopic\_bertopic.py:440, in BERTopic.fit_transform(self, documents, embeddings, images, y)
    437         documents = self._reduce_topics(documents)
    439     # Save the top 3 most representative documents per topic
--> 440     self._save_representative_docs(documents)
    442 # Resulting output
    443 self.probabilities_ = self._map_probabilities(probabilities, original_topics=True)

File c:\path\lib\site-packages\bertopic\_bertopic.py:3654, in BERTopic._save_representative_docs(self, documents)
   3645 def _save_representative_docs(self, documents: pd.DataFrame):
   3646     """ Save the 3 most representative docs per topic
   3647 
   3648     Arguments:
   (...)
   3652         self.representative_docs_: Populate each topic with 3 representative docs
   3653     """
-> 3654     repr_docs, _, _, _ = self._extract_representative_docs(
   3655         self.c_tf_idf_,
   3656         documents,
   3657         self.topic_representations_,
   3658         nr_samples=500,
   3659         nr_repr_docs=3
   3660     )
   3661     self.representative_docs_ = repr_docs

File c:\path\lib\site-packages\bertopic\_bertopic.py:3718, in BERTopic._extract_representative_docs(self, c_tf_idf, documents, topics, nr_samples, nr_repr_docs, diversity)
   3716 bow = self.vectorizer_model.transform(selected_docs)
   3717 ctfidf = self.ctfidf_model.transform(bow)
-> 3718 sim_matrix = cosine_similarity(ctfidf, c_tf_idf[index])
   3720 # Use MMR to find representative but diverse documents
   3721 if diversity:

File c:\path\lib\site-packages\sklearn\utils\_param_validation.py:211, in validate_params.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    205 try:
    206     with config_context(
    207         skip_parameter_validation=(
    208             prefer_skip_nested_validation or global_skip_validation
    209         )
    210     ):
--> 211         return func(*args, **kwargs)
    212 except InvalidParameterError as e:
    213     # When the function is just a wrapper around an estimator, we allow
    214     # the function to delegate validation to the estimator, but we replace
    215     # the name of the estimator by the name of the function in the error
    216     # message to avoid confusion.
    217     msg = re.sub(
    218         r"parameter of \w+ must be",
    219         f"parameter of {func.__qualname__} must be",
    220         str(e),
    221     )

File c:\path\lib\site-packages\sklearn\metrics\pairwise.py:1577, in cosine_similarity(X, Y, dense_output)
   1542 """Compute cosine similarity between samples in X and Y.
   1543 
   1544 Cosine similarity, or the cosine kernel, computes similarity as the
   (...)
   1573     Returns the cosine similarity between samples in X and Y.
   1574 """
   1575 # to avoid recursive import
-> 1577 X, Y = check_pairwise_arrays(X, Y)
   1579 X_normalized = normalize(X, copy=True)
   1580 if X is Y:

File c:\path\lib\site-packages\sklearn\metrics\pairwise.py:165, in check_pairwise_arrays(X, Y, precomputed, dtype, accept_sparse, force_all_finite, copy)
    156     X = Y = check_array(
    157         X,
    158         accept_sparse=accept_sparse,
   (...)
    162         estimator=estimator,
    163     )
    164 else:
--> 165     X = check_array(
    166         X,
    167         accept_sparse=accept_sparse,
    168         dtype=dtype,
    169         copy=copy,
    170         force_all_finite=force_all_finite,
    171         estimator=estimator,
    172     )
    173     Y = check_array(
    174         Y,
    175         accept_sparse=accept_sparse,
   (...)
    179         estimator=estimator,
    180     )
    182 if precomputed:

File c:\path\lib\site-packages\sklearn\utils\validation.py:883, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
    881 if sp.issparse(array):
    882     _ensure_no_complex_data(array)
--> 883     array = _ensure_sparse_format(
    884         array,
    885         accept_sparse=accept_sparse,
    886         dtype=dtype,
    887         copy=copy,
    888         force_all_finite=force_all_finite,
    889         accept_large_sparse=accept_large_sparse,
    890         estimator_name=estimator_name,
    891         input_name=input_name,
    892     )
    893 else:
    894     # If np.array(..) gives ComplexWarning, then we convert the warning
    895     # to an error. This is needed because specifying a non complex
    896     # dtype to the function converts complex to real dtype,
    897     # thereby passing the test made in the lines following the scope
    898     # of warnings context manager.
    899     with warnings.catch_warnings():

File c:\path\lib\site-packages\sklearn\utils\validation.py:573, in _ensure_sparse_format(spmatrix, accept_sparse, dtype, copy, force_all_finite, accept_large_sparse, estimator_name, input_name)
    568         warnings.warn(
    569             "Can't check %s sparse matrix for nan or inf." % spmatrix.format,
    570             stacklevel=2,
    571         )
    572     else:
--> 573         _assert_all_finite(
    574             spmatrix.data,
    575             allow_nan=force_all_finite == "allow-nan",
    576             estimator_name=estimator_name,
    577             input_name=input_name,
    578         )
    580 return spmatrix

File c:\path\lib\site-packages\sklearn\utils\validation.py:124, in _assert_all_finite(X, allow_nan, msg_dtype, estimator_name, input_name)
    121 if first_pass_isfinite:
    122     return
--> 124 _assert_all_finite_element_wise(
    125     X,
    126     xp=xp,
    127     allow_nan=allow_nan,
    128     msg_dtype=msg_dtype,
    129     estimator_name=estimator_name,
    130     input_name=input_name,
    131 )

File c:\path\lib\site-packages\sklearn\utils\validation.py:173, in _assert_all_finite_element_wise(X, xp, allow_nan, msg_dtype, estimator_name, input_name)
    156 if estimator_name and input_name == "X" and has_nan_error:
    157     # Improve the error message on how to handle missing values in
    158     # scikit-learn.
    159     msg_err += (
    160         f"\n{estimator_name} does not accept missing values"
    161         " encoded as NaN natively. For supervised learning, you might want"
   (...)
    171         "#estimators-that-handle-nan-values"
    172     )
--> 173 raise ValueError(msg_err)

ValueError: Input contains infinity or a value too large for dtype('float64').

Any idea why this example doesn't work?

MaartenGr commented 10 months ago

I am not entirely sure but it might be that there are topics without any words at all since they could be removed. Maybe reducing the min_df parameter would work here.

zilch42 commented 10 months ago

I thought about that, but your KeyBERT example doesn't use min_df at all. Every topic should have words in that example.

I wonder if it is that there are docs that don't have any words that line up with the topic words? But there are more words cut out of the vocab when doing the filtering on topic-occurrences than there would be with document-occurrences, and I've never seen this error in the first (default) case. The error seems to result from supplying CountVectorizer with a vocabulary.

Does that KeyBERT example run for you?

The issue seems to arise when selecting representative documents. If it is the result of docs that don't have any words that line up with the topic words, those docs are likely not representative documents. Could they just be skipped in the similarity calculation?

MaartenGr commented 10 months ago

Ah right, force of habit. I indeed meant that there might be a chance that some topics end up without words by restricting the vocabulary. It would be the same with setting min_df too high.

Does that KeyBERT example run for you?

I haven't had the time but perhaps can check it out later this week.

The issue seems to arise when selecting representative documents. If it is the result of docs that don't have any words that line up with the topic words, those docs are likely not representative documents. Could they just be skipped in the similarity calculation?

It is not exactly the result of words not lining up but the calculation of a distance metric that is getting incorrect values. The main issue lies here:

sim_matrix = cosine_similarity(ctfidf, c_tf_idf[index])

This means that there are values in either one or both matrices that result in infinite values being created. My guess is that the c-TF-IDF calculation might create strange values if there are few to no words in a topic.

zilch42 commented 10 months ago

Ok, I've figured it out. The docs inside BERTopic get cleaned internally by _preprocess_text() before being tokenized, so by creating a vocabulary outside of BERTopic, even if it is created from the same document set, results in words in the vocabulary that do not appear in the cleaned docs. This results in a df with zero's here:

https://github.com/MaartenGr/BERTopic/blob/fc9a51aee18011afbbf4045f30460c2970527c49/bertopic/vectorizers/_ctfidf.py#L69-L70

which results in inf values when dividing by zero here:

https://github.com/MaartenGr/BERTopic/blob/fc9a51aee18011afbbf4045f30460c2970527c49/bertopic/vectorizers/_ctfidf.py#L79-L82

which causes problems for cosine_similarity().

This can be solved by preprocessing the docs in the same way before creating the vocabulary:

from sklearn.datasets import fetch_20newsgroups
from keybert import KeyBERT

# Prepare documents 
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

def preprocess_text(documents: np.ndarray):
        """ Basic preprocessing of text

        Steps:
            * Replace \n and \t with whitespace
            * Only keep alpha-numerical characters
        """
        cleaned_documents = [doc.replace("\n", " ") for doc in documents]
        cleaned_documents = [doc.replace("\t", " ") for doc in cleaned_documents]
        cleaned_documents = [re.sub(r'[^A-Za-z0-9 ]+', '', doc) for doc in cleaned_documents]
        cleaned_documents = [doc if doc != "" else "emptydoc" for doc in cleaned_documents]
        return cleaned_documents

docs = preprocess_text(docs)

# Extract keywords
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(docs)

# Create our vocabulary
vocabulary = [k[0] for keyword in keywords for k in keyword]
vocabulary = list(set(vocabulary))

from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_model= CountVectorizer(vocabulary=vocabulary)
topic_model = BERTopic(vectorizer_model=vectorizer_model, verbose=True)
topics, probs = topic_model.fit_transform(docs)

It would be nice to just use the internal function _preprocess_text so that I didn't have to keep track of any changes to that function, but we are creating the vocabulary before the BERTopic object so I'm not sure if there is a cleaner way to do this.

I'm happy for this to be closed, but you may wish to update the KeyBERT example in Tips and Tricks so that it actually runs (and you may have a more elegant solution than I have above 😄 ).

MaartenGr commented 10 months ago

Thanks for figuring this out! Definitely seems like a bug that should be fixed in a future release. I am a bit surprised though that this happened since the steps for preprocessing are almost identical to what is happening inside the CountVectorizer. Do you by chance have an example of a document that gets rendered differently by using the preprocessing steps?

zilch42 commented 10 months ago

Sure, most docs in newsgroups have at least one example but try doc[8]. It has a few

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import re

# Prepare documents 
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

def preprocess_text(documents: np.ndarray):
        """ Basic preprocessing of text

        Steps:
            * Replace \n and \t with whitespace
            * Only keep alpha-numerical characters
        """
        cleaned_documents = [doc.replace("\n", " ") for doc in documents]
        cleaned_documents = [doc.replace("\t", " ") for doc in cleaned_documents]
        cleaned_documents = [re.sub(r'[^A-Za-z0-9 ]+', '', doc) for doc in cleaned_documents]
        cleaned_documents = [doc if doc != "" else "emptydoc" for doc in cleaned_documents]
        return cleaned_documents

clean_docs = preprocess_text(docs)

pre_vectorizer_model=CountVectorizer(stop_words="english")
pre_vectorizer_model.fit([docs[8]])
vocabulary = set(pre_vectorizer_model.vocabulary_.keys())

pre_vectorizer_model=CountVectorizer(stop_words="english")
pre_vectorizer_model.fit([clean_docs[8]])
clean_vocabulary = set(pre_vectorizer_model.vocabulary_.keys())

vocabulary.difference(clean_vocabulary)

{'bruin', 'didn', 'fuhr', 'sabre', 've'}

docs[8]

"\n\n\nYeah, it's the second one.  And I believe that price too.  I've been trying\nto get a good look at it on the Bruin-Sabre telecasts, and wow! does it ever\nlook good.  Whoever did that paint job knew what they were doing.  And given\nFuhr's play since he got it, I bet the Bruins are wishing he didn't have it:)\n"

It looks like the removal of punctuation causes them to split differently.

If you compare the vocabularies from the whole dataset, a lot of the different words involve numbers too, but again, it may be down to punctuation removal.

pre_vectorizer_model=CountVectorizer(stop_words="english")
pre_vectorizer_model.fit(docs)
vocabulary = set(pre_vectorizer_model.vocabulary_.keys())

pre_vectorizer_model=CountVectorizer(stop_words="english")
pre_vectorizer_model.fit(clean_docs)
clean_vocabulary = set(pre_vectorizer_model.vocabulary_.keys())

vocabulary.difference(clean_vocabulary)

{'8337',
 'mwhwc',
 'dob',
 'mgga2a',
 'd8yq',
 'eb1l',
 '7lbs',
 'lk8rlhzrlk8v',
 'iinsfdop93nn',
 'kelleyb',
 'wa4e',
 'wzr',
 'm66l4i',
 'e_xv3',
 'm87',
 'mcng',
 'qvdz',
 '77pt',
 'pmjpeg11',
 'vokabul',
 '0vy4b',
 'u40q',
 'w4k',
 'ldmy',
 'kd4fkw',
 'vsbd',
 '0669',
 '3mh',
 'acxx',
 '4wj',
 'obqp',
 'xbl4',
 'p6x',
 'jn2',
...

MaartenGr commented 10 months ago

Thanks for the extensive example! It definitely seems like the preprocessing steps update the way these documents are represented, even if they are just small cleaning steps. Hmmm, I am not sure what the best way to approach here is. Technically, I could remove the preprocessing step entirely since CountVectorizer takes care of most of it but I am not too keen of letting it handle things like "\n" and such.

zilch42 commented 10 months ago

I didn't necessarily see it as I bug for my use case. I like the preprocessing, and I wouldn't necessarily want things like "couldn't" being transformed to ["couldn", "t"] which I think is what CountVectorizer would do on its own. It makes sense that I would have to replicate any cleaning you are doing internally on the outside if I wanted to preprocess a vocabulary.

Is the mismatch in preprocessing causing/likely to cause any other issues? Or can it just be left as is?

MaartenGr commented 10 months ago

It is quite an uncommon error that pops up and often resolved with some simple settings (for example reducing words with c-TF-IDF's parameters) or indeed accessing the internal preprocessing, so I would leave it for now. That said, thanks for sharing this! It definitely helps understand how people are using some of its functionality and the issues they run into.

MaartenGr / BERTopic

Words not in CountVectorizer vocab despite being well above min_df threshold #1665