Open zilch42 opened 10 months ago
Before going into what might causing this, it is worthwhile to instead check whether this is the case when performing the tokenization with the vectorizer model instead. That way, we can be 100% sure about the counts. Could you check the counts instead with build_analyzer and the tokenizer to see whether that indeed does not match?
Thanks for the tip.
Is this what you mean?
analyzer = vectorizer_model.build_analyzer()
len([d for d in abstracts if "cancer" in analyzer(d)])
>>> 43
I'm not entirely sure but it can also be a definition issue. If min_df refers to the document frequency, as in the number of documents that contain a certain word, then the counts you did are actually not representative of a "document". In BERTopic, the CountVectorizer is not trained on all documents but on all topics. This means that the documents are first concatenated into a single string for each topic before fitting the CountVectorizer. As a result, the number of documents now refer to the number of topics instead since each document is a single string containing all documents in a topic.
Ah, I think that is probably it! Thank you. So if I had a topic model with very good topic separation, and an important descriptive keyword that only appeared in one or two topics, then min_df
would not be my friend for controlling the size of the ctf_idf matrix?
I've tried pre filtering the vocabulary so that I can apply the min_df
threshold to document occurrences rather than topic occurrences, which I think makes more sense for my use case, but I am running into an error.
from umap import UMAP
from datasets import load_dataset
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]
abstracts = dataset["abstract"][:5_000]
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
# filter vocabulary on document occurrences
pre_vectorizer_model=CountVectorizer(min_df=10, stop_words="english", ngram_range=(1,3))
pre_vectorizer_model.fit(abstracts)
vocabulary = list(set(pre_vectorizer_model.vocabulary_.keys()))
vectorizer_model = CountVectorizer(vocabulary=vocabulary)
topic_model = BERTopic(umap_model=umap_model, vectorizer_model=vectorizer_model).fit(abstracts)
ValueError: Input contains infinity or a value too large for dtype('float64').
I adapted this from the example on KeyBERT in Tips & Tricks, but that example is throwing the same error. https://maartengr.github.io/BERTopic/getting_started/tips_and_tricks/tips_and_tricks.html#keybert-bertopic
from sklearn.datasets import fetch_20newsgroups
from keybert import KeyBERT
# Prepare documents
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
# Extract keywords
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(docs)
# Create our vocabulary
vocabulary = [k[0] for keyword in keywords for k in keyword]
vocabulary = list(set(vocabulary))
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model= CountVectorizer(vocabulary=vocabulary)
topic_model = BERTopic(vectorizer_model=vectorizer_model)
topics, probs = topic_model.fit_transform(docs)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
c:\path\count_vectorizer_issue.ipynb Cell 2 line 2
18 vectorizer_model= CountVectorizer(vocabulary=vocabulary)
19 topic_model = BERTopic(vectorizer_model=vectorizer_model)
---> 20 topics, probs = topic_model.fit_transform(docs)
File c:\path\lib\site-packages\bertopic\_bertopic.py:440, in BERTopic.fit_transform(self, documents, embeddings, images, y)
437 documents = self._reduce_topics(documents)
439 # Save the top 3 most representative documents per topic
--> 440 self._save_representative_docs(documents)
442 # Resulting output
443 self.probabilities_ = self._map_probabilities(probabilities, original_topics=True)
File c:\path\lib\site-packages\bertopic\_bertopic.py:3654, in BERTopic._save_representative_docs(self, documents)
3645 def _save_representative_docs(self, documents: pd.DataFrame):
3646 """ Save the 3 most representative docs per topic
3647
3648 Arguments:
(...)
3652 self.representative_docs_: Populate each topic with 3 representative docs
3653 """
-> 3654 repr_docs, _, _, _ = self._extract_representative_docs(
3655 self.c_tf_idf_,
3656 documents,
3657 self.topic_representations_,
3658 nr_samples=500,
3659 nr_repr_docs=3
3660 )
3661 self.representative_docs_ = repr_docs
File c:\path\lib\site-packages\bertopic\_bertopic.py:3718, in BERTopic._extract_representative_docs(self, c_tf_idf, documents, topics, nr_samples, nr_repr_docs, diversity)
3716 bow = self.vectorizer_model.transform(selected_docs)
3717 ctfidf = self.ctfidf_model.transform(bow)
-> 3718 sim_matrix = cosine_similarity(ctfidf, c_tf_idf[index])
3720 # Use MMR to find representative but diverse documents
3721 if diversity:
File c:\path\lib\site-packages\sklearn\utils\_param_validation.py:211, in validate_params.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
205 try:
206 with config_context(
207 skip_parameter_validation=(
208 prefer_skip_nested_validation or global_skip_validation
209 )
210 ):
--> 211 return func(*args, **kwargs)
212 except InvalidParameterError as e:
213 # When the function is just a wrapper around an estimator, we allow
214 # the function to delegate validation to the estimator, but we replace
215 # the name of the estimator by the name of the function in the error
216 # message to avoid confusion.
217 msg = re.sub(
218 r"parameter of \w+ must be",
219 f"parameter of {func.__qualname__} must be",
220 str(e),
221 )
File c:\path\lib\site-packages\sklearn\metrics\pairwise.py:1577, in cosine_similarity(X, Y, dense_output)
1542 """Compute cosine similarity between samples in X and Y.
1543
1544 Cosine similarity, or the cosine kernel, computes similarity as the
(...)
1573 Returns the cosine similarity between samples in X and Y.
1574 """
1575 # to avoid recursive import
-> 1577 X, Y = check_pairwise_arrays(X, Y)
1579 X_normalized = normalize(X, copy=True)
1580 if X is Y:
File c:\path\lib\site-packages\sklearn\metrics\pairwise.py:165, in check_pairwise_arrays(X, Y, precomputed, dtype, accept_sparse, force_all_finite, copy)
156 X = Y = check_array(
157 X,
158 accept_sparse=accept_sparse,
(...)
162 estimator=estimator,
163 )
164 else:
--> 165 X = check_array(
166 X,
167 accept_sparse=accept_sparse,
168 dtype=dtype,
169 copy=copy,
170 force_all_finite=force_all_finite,
171 estimator=estimator,
172 )
173 Y = check_array(
174 Y,
175 accept_sparse=accept_sparse,
(...)
179 estimator=estimator,
180 )
182 if precomputed:
File c:\path\lib\site-packages\sklearn\utils\validation.py:883, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
881 if sp.issparse(array):
882 _ensure_no_complex_data(array)
--> 883 array = _ensure_sparse_format(
884 array,
885 accept_sparse=accept_sparse,
886 dtype=dtype,
887 copy=copy,
888 force_all_finite=force_all_finite,
889 accept_large_sparse=accept_large_sparse,
890 estimator_name=estimator_name,
891 input_name=input_name,
892 )
893 else:
894 # If np.array(..) gives ComplexWarning, then we convert the warning
895 # to an error. This is needed because specifying a non complex
896 # dtype to the function converts complex to real dtype,
897 # thereby passing the test made in the lines following the scope
898 # of warnings context manager.
899 with warnings.catch_warnings():
File c:\path\lib\site-packages\sklearn\utils\validation.py:573, in _ensure_sparse_format(spmatrix, accept_sparse, dtype, copy, force_all_finite, accept_large_sparse, estimator_name, input_name)
568 warnings.warn(
569 "Can't check %s sparse matrix for nan or inf." % spmatrix.format,
570 stacklevel=2,
571 )
572 else:
--> 573 _assert_all_finite(
574 spmatrix.data,
575 allow_nan=force_all_finite == "allow-nan",
576 estimator_name=estimator_name,
577 input_name=input_name,
578 )
580 return spmatrix
File c:\path\lib\site-packages\sklearn\utils\validation.py:124, in _assert_all_finite(X, allow_nan, msg_dtype, estimator_name, input_name)
121 if first_pass_isfinite:
122 return
--> 124 _assert_all_finite_element_wise(
125 X,
126 xp=xp,
127 allow_nan=allow_nan,
128 msg_dtype=msg_dtype,
129 estimator_name=estimator_name,
130 input_name=input_name,
131 )
File c:\path\lib\site-packages\sklearn\utils\validation.py:173, in _assert_all_finite_element_wise(X, xp, allow_nan, msg_dtype, estimator_name, input_name)
156 if estimator_name and input_name == "X" and has_nan_error:
157 # Improve the error message on how to handle missing values in
158 # scikit-learn.
159 msg_err += (
160 f"\n{estimator_name} does not accept missing values"
161 " encoded as NaN natively. For supervised learning, you might want"
(...)
171 "#estimators-that-handle-nan-values"
172 )
--> 173 raise ValueError(msg_err)
ValueError: Input contains infinity or a value too large for dtype('float64').
Any idea why this example doesn't work?
I am not entirely sure but it might be that there are topics without any words at all since they could be removed. Maybe reducing the min_df parameter would work here.
I thought about that, but your KeyBERT example doesn't use min_df
at all. Every topic should have words in that example.
I wonder if it is that there are docs that don't have any words that line up with the topic words? But there are more words cut out of the vocab when doing the filtering on topic-occurrences than there would be with document-occurrences, and I've never seen this error in the first (default) case. The error seems to result from supplying CountVectorizer with a vocabulary
.
Does that KeyBERT example run for you?
The issue seems to arise when selecting representative documents. If it is the result of docs that don't have any words that line up with the topic words, those docs are likely not representative documents. Could they just be skipped in the similarity calculation?
Ah right, force of habit. I indeed meant that there might be a chance that some topics end up without words by restricting the vocabulary. It would be the same with setting min_df
too high.
Does that KeyBERT example run for you?
I haven't had the time but perhaps can check it out later this week.
The issue seems to arise when selecting representative documents. If it is the result of docs that don't have any words that line up with the topic words, those docs are likely not representative documents. Could they just be skipped in the similarity calculation?
It is not exactly the result of words not lining up but the calculation of a distance metric that is getting incorrect values. The main issue lies here:
sim_matrix = cosine_similarity(ctfidf, c_tf_idf[index])
This means that there are values in either one or both matrices that result in infinite values being created. My guess is that the c-TF-IDF calculation might create strange values if there are few to no words in a topic.
Ok, I've figured it out. The docs inside BERTopic get cleaned internally by _preprocess_text()
before being tokenized, so by creating a vocabulary outside of BERTopic, even if it is created from the same document set, results in words in the vocabulary that do not appear in the cleaned docs. This results in a df with zero's here:
which results in inf values when dividing by zero here:
which causes problems for cosine_similarity()
.
This can be solved by preprocessing the docs in the same way before creating the vocabulary:
from sklearn.datasets import fetch_20newsgroups
from keybert import KeyBERT
# Prepare documents
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
def preprocess_text(documents: np.ndarray):
""" Basic preprocessing of text
Steps:
* Replace \n and \t with whitespace
* Only keep alpha-numerical characters
"""
cleaned_documents = [doc.replace("\n", " ") for doc in documents]
cleaned_documents = [doc.replace("\t", " ") for doc in cleaned_documents]
cleaned_documents = [re.sub(r'[^A-Za-z0-9 ]+', '', doc) for doc in cleaned_documents]
cleaned_documents = [doc if doc != "" else "emptydoc" for doc in cleaned_documents]
return cleaned_documents
docs = preprocess_text(docs)
# Extract keywords
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(docs)
# Create our vocabulary
vocabulary = [k[0] for keyword in keywords for k in keyword]
vocabulary = list(set(vocabulary))
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model= CountVectorizer(vocabulary=vocabulary)
topic_model = BERTopic(vectorizer_model=vectorizer_model, verbose=True)
topics, probs = topic_model.fit_transform(docs)
It would be nice to just use the internal function _preprocess_text
so that I didn't have to keep track of any changes to that function, but we are creating the vocabulary before the BERTopic object so I'm not sure if there is a cleaner way to do this.
I'm happy for this to be closed, but you may wish to update the KeyBERT example in Tips and Tricks so that it actually runs (and you may have a more elegant solution than I have above 😄 ).
Thanks for figuring this out! Definitely seems like a bug that should be fixed in a future release. I am a bit surprised though that this happened since the steps for preprocessing are almost identical to what is happening inside the CountVectorizer. Do you by chance have an example of a document that gets rendered differently by using the preprocessing steps?
Sure, most docs in newsgroups have at least one example but try doc[8]. It has a few
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import re
# Prepare documents
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
def preprocess_text(documents: np.ndarray):
""" Basic preprocessing of text
Steps:
* Replace \n and \t with whitespace
* Only keep alpha-numerical characters
"""
cleaned_documents = [doc.replace("\n", " ") for doc in documents]
cleaned_documents = [doc.replace("\t", " ") for doc in cleaned_documents]
cleaned_documents = [re.sub(r'[^A-Za-z0-9 ]+', '', doc) for doc in cleaned_documents]
cleaned_documents = [doc if doc != "" else "emptydoc" for doc in cleaned_documents]
return cleaned_documents
clean_docs = preprocess_text(docs)
pre_vectorizer_model=CountVectorizer(stop_words="english")
pre_vectorizer_model.fit([docs[8]])
vocabulary = set(pre_vectorizer_model.vocabulary_.keys())
pre_vectorizer_model=CountVectorizer(stop_words="english")
pre_vectorizer_model.fit([clean_docs[8]])
clean_vocabulary = set(pre_vectorizer_model.vocabulary_.keys())
vocabulary.difference(clean_vocabulary)
{'bruin', 'didn', 'fuhr', 'sabre', 've'}
docs[8]
"\n\n\nYeah, it's the second one. And I believe that price too. I've been trying\nto get a good look at it on the Bruin-Sabre telecasts, and wow! does it ever\nlook good. Whoever did that paint job knew what they were doing. And given\nFuhr's play since he got it, I bet the Bruins are wishing he didn't have it:)\n"
It looks like the removal of punctuation causes them to split differently.
If you compare the vocabularies from the whole dataset, a lot of the different words involve numbers too, but again, it may be down to punctuation removal.
pre_vectorizer_model=CountVectorizer(stop_words="english")
pre_vectorizer_model.fit(docs)
vocabulary = set(pre_vectorizer_model.vocabulary_.keys())
pre_vectorizer_model=CountVectorizer(stop_words="english")
pre_vectorizer_model.fit(clean_docs)
clean_vocabulary = set(pre_vectorizer_model.vocabulary_.keys())
vocabulary.difference(clean_vocabulary)
{'8337',
'mwhwc',
'dob',
'mgga2a',
'd8yq',
'eb1l',
'7lbs',
'lk8rlhzrlk8v',
'iinsfdop93nn',
'kelleyb',
'wa4e',
'wzr',
'm66l4i',
'e_xv3',
'm87',
'mcng',
'qvdz',
'77pt',
'pmjpeg11',
'vokabul',
'0vy4b',
'u40q',
'w4k',
'ldmy',
'kd4fkw',
'vsbd',
'0669',
'3mh',
'acxx',
'4wj',
'obqp',
'xbl4',
'p6x',
'jn2',
...
Thanks for the extensive example! It definitely seems like the preprocessing steps update the way these documents are represented, even if they are just small cleaning steps. Hmmm, I am not sure what the best way to approach here is. Technically, I could remove the preprocessing step entirely since CountVectorizer takes care of most of it but I am not too keen of letting it handle things like "\n" and such.
I didn't necessarily see it as I bug for my use case. I like the preprocessing, and I wouldn't necessarily want things like "couldn't" being transformed to ["couldn", "t"] which I think is what CountVectorizer would do on its own. It makes sense that I would have to replicate any cleaning you are doing internally on the outside if I wanted to preprocess a vocabulary.
Is the mismatch in preprocessing causing/likely to cause any other issues? Or can it just be left as is?
It is quite an uncommon error that pops up and often resolved with some simple settings (for example reducing words with c-TF-IDF's parameters) or indeed accessing the internal preprocessing, so I would leave it for now. That said, thanks for sharing this! It definitely helps understand how people are using some of its functionality and the issues they run into.
Hi Maarten,
I'm having an issue with some important words not appearing within CountVectorizer when using
min_df
even though they are well above the set threshold. My understanding ofmin_df
is that when it is an integer (say, 10), all words that appear in at least 10 documents will be kept. Is that correct? See the example below,Take the word 'cancer' for example.
'cancer' appears in at least 32 documents (not counting occurrences up against punctuation so should be well above
min_df
.returns a key error.
Also try words: privacy, reward, quantum, fuzzy, hashing, planning, player, robot, embeddings
If I remove the min_df param and just use
CountVectorizer()
, then these words are all present.I first came across this with the word 'rust' which was an important word in a large topic with ~70 document occurrences and
min_df=10
. It isn't prominent in the arxiv dataset though.Do you know what is happening here?