Closed aleksandar-devedzic closed 1 year ago
My guess here is that there might be some topics that contain empty documents after the tokenization. Although all of your documents might be strings and filled with words the vocabulary
parameter then tokenizes these documents and counts how many of its tokens appear in vocabulary
. It might be that the vocabulary that you passed does not contain any of the words that are found in a specific topic. As a result, there might be a topic with a bag-of-words that contains only zeros. What happens if you did not set vocabulary
? Also, how many documents are in final_df.body
? One thing to note, BERTopic expects a list of strings and not a pandas series. It should not give any issues but it was not fully tested for that.
My vocabulary list contains around 90k elements. My final_df.body contains 15k elements (fyi, I have tried with 100, 1000, 1000, 15k elements, It gives me the same error). If I use, for example:
hdbscan_model = kmenas_cluster_model,
vectorizer_model = transformerVectoriser,
embedding_model = "paraphrase-multilingual-MiniLM-L12-v2"
or only:
embedding_model = "paraphrase-multilingual-MiniLM-L12-v2"
I do not get any errors
So removing the cluster model prevents the issue from happening? Strange, perhaps there are some microclusters being generated with k-Means that result in some issues but I am not entirely sure why.
My bad, I forgot to add clustering model in previous comment, it happens with and without KMeans
Just to be sure I understood it correctly, it only happens if you use the vectorizer_model
? If so, can you try it with the vectorizer model but without setting vocabulary
?
It only happens with custom vocabulary. If I use this, it works (It does not matter if I use max_features=100 or max_features=100000):
transformerVectoriser = CountVectorizer(analyzer = 'word', ngram_range = (1, 4), max_features = 30000)
But when I pass my custom vocabulary (list of strings, around 100k of elements), only in that case I get an error
Most likely, that vocabulary is limited in the case of certain clusters. It might just be that there are certain clusters that do not contain any of the words in the given vocabulary and as a result is completely sparse.
Due to inactivity, I'll be closing this issue. Let me know if you want me to re-open the issue!
I want to make BERTopic model with my clustering algorithm (KMeans) and my Vectorizer (Count Vectorizer), but I keep getting this error when I want to do
.fit_transform(data)
:Warining:
And then, error:
This is my full code:
I really do not know what is the problem, and what is going on. All values in
vocab_list
are string values and all values infinal_df.body
are string values