MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.12k stars 763 forks source link

ValueError: Input contains NaN, ... when doing fit_transform(data) #958

Closed aleksandar-devedzic closed 1 year ago

aleksandar-devedzic commented 1 year ago

I want to make BERTopic model with my clustering algorithm (KMeans) and my Vectorizer (Count Vectorizer), but I keep getting this error when I want to do.fit_transform(data) :

Warining:

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/bertopic/vectorizers/_ctfidf.py:69: RuntimeWarning:

divide by zero encountered in divide

And then, error:

ValueError                                Traceback (most recent call last)
<ipython-input-104-1f024d22018f> in <module>
----> 1 topics, probs = bert_topic_model.fit_transform(final_df.body)

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/bertopic/_bertopic.py in fit_transform(self, documents, embeddings, y)
    368             self._map_representative_docs(original_topics=True)
    369         else:
--> 370             self._save_representative_docs(documents)
    371 
    372         self.probabilities_ = self._map_probabilities(probabilities, original_topics=True)

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/bertopic/_bertopic.py in _save_representative_docs(self, documents)
   3000                 bow = self.vectorizer_model.transform(selected_docs)
   3001                 ctfidf = self.ctfidf_model.transform(bow)
-> 3002                 sim_matrix = cosine_similarity(ctfidf, self.c_tf_idf_[topic + self._outliers])
   3003 
   3004                 # Extract top 3 most representative documents

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sklearn/metrics/pairwise.py in cosine_similarity(X, Y, dense_output)
   1178     # to avoid recursive import
   1179 
-> 1180     X, Y = check_pairwise_arrays(X, Y)
   1181 
   1182     X_normalized = normalize(X, copy=True)

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     61             extra_args = len(args) - len(all_args)
     62             if extra_args <= 0:
---> 63                 return f(*args, **kwargs)
     64 
     65             # extra_args > 0

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sklearn/metrics/pairwise.py in check_pairwise_arrays(X, Y, precomputed, dtype, accept_sparse, force_all_finite, copy)
    144                             estimator=estimator)
    145     else:
--> 146         X = check_array(X, accept_sparse=accept_sparse, dtype=dtype,
    147                         copy=copy, force_all_finite=force_all_finite,
    148                         estimator=estimator)

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     61             extra_args = len(args) - len(all_args)
     62             if extra_args <= 0:
---> 63                 return f(*args, **kwargs)
     64 
     65             # extra_args > 0

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    648     if sp.issparse(array):
    649         _ensure_no_complex_data(array)
--> 650         array = _ensure_sparse_format(array, accept_sparse=accept_sparse,
    651                                       dtype=dtype, copy=copy,
    652                                       force_all_finite=force_all_finite,

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sklearn/utils/validation.py in _ensure_sparse_format(spmatrix, accept_sparse, dtype, copy, force_all_finite, accept_large_sparse)
    446                           % spmatrix.format, stacklevel=2)
    447         else:
--> 448             _assert_all_finite(spmatrix.data,
    449                                allow_nan=force_all_finite == 'allow-nan')
    450 

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
    101                 not allow_nan and not np.isfinite(X).all()):
    102             type_err = 'infinity' if allow_nan else 'NaN, infinity'
--> 103             raise ValueError(
    104                     msg_err.format
    105                     (type_err,

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

This is my full code:


features = final_df["body"] # does not have NaN or Infinite values, I have checked 10 times
transformerVectoriser = CountVectorizer(analyzer = 'word', ngram_range = (1, 4), vocabulary = vocab_list)
#my vocab list does not have NaN or Infinite values, I have checked 10 times

cluster_model = KMeans(n_clusters = 50, init='k-means++', max_iter = 1500, random_state=None)

bert_topic_model = BERTopic(hdbscan_model = cluster_model, 
                            vectorizer_model = transformerVectoriser, 
                            verbose = True, 
                            top_n_words = 15)

#final_df.body does not have NaN or Infinite values, I have checked 10 times
topics, probs = bert_topic_model.fit_transform(final_df.body) #ERROR

I really do not know what is the problem, and what is going on. All values in vocab_list are string values and all values in final_df.body are string values

MaartenGr commented 1 year ago

My guess here is that there might be some topics that contain empty documents after the tokenization. Although all of your documents might be strings and filled with words the vocabulary parameter then tokenizes these documents and counts how many of its tokens appear in vocabulary. It might be that the vocabulary that you passed does not contain any of the words that are found in a specific topic. As a result, there might be a topic with a bag-of-words that contains only zeros. What happens if you did not set vocabulary? Also, how many documents are in final_df.body? One thing to note, BERTopic expects a list of strings and not a pandas series. It should not give any issues but it was not fully tested for that.

aleksandar-devedzic commented 1 year ago

My vocabulary list contains around 90k elements. My final_df.body contains 15k elements (fyi, I have tried with 100, 1000, 1000, 15k elements, It gives me the same error). If I use, for example:

hdbscan_model = kmenas_cluster_model,
vectorizer_model = transformerVectoriser, 
embedding_model = "paraphrase-multilingual-MiniLM-L12-v2"

or only: embedding_model = "paraphrase-multilingual-MiniLM-L12-v2"

I do not get any errors

MaartenGr commented 1 year ago

So removing the cluster model prevents the issue from happening? Strange, perhaps there are some microclusters being generated with k-Means that result in some issues but I am not entirely sure why.

aleksandar-devedzic commented 1 year ago

My bad, I forgot to add clustering model in previous comment, it happens with and without KMeans

MaartenGr commented 1 year ago

Just to be sure I understood it correctly, it only happens if you use the vectorizer_model? If so, can you try it with the vectorizer model but without setting vocabulary?

aleksandar-devedzic commented 1 year ago

It only happens with custom vocabulary. If I use this, it works (It does not matter if I use max_features=100 or max_features=100000):

transformerVectoriser = CountVectorizer(analyzer = 'word', ngram_range = (1, 4), max_features = 30000)

But when I pass my custom vocabulary (list of strings, around 100k of elements), only in that case I get an error

MaartenGr commented 1 year ago

Most likely, that vocabulary is limited in the case of certain clusters. It might just be that there are certain clusters that do not contain any of the words in the given vocabulary and as a result is completely sparse.

MaartenGr commented 1 year ago

Due to inactivity, I'll be closing this issue. Let me know if you want me to re-open the issue!