I'm on an M3 MacBook Pro
Python 3.12.4
scikit-learn 1.5.1
bertopic 0.16.3
numpy 1.26.4
scipy 1.14.0
from sklearn.datasets import fetch_20newsgroups
from sklearn.cluster import MiniBatchKMeans
from sklearn.decomposition import IncrementalPCA
from bertopic.vectorizers import OnlineCountVectorizer
from bertopic import BERTopic
# Prepare documents
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))["data"]
# Prepare sub-models that support online learning
umap_model = IncrementalPCA(n_components=5)
cluster_model = MiniBatchKMeans(n_clusters=50, random_state=0)
vectorizer_model = OnlineCountVectorizer(stop_words="english", decay=.01)
topic_model = BERTopic(umap_model=umap_model,
hdbscan_model=cluster_model,
vectorizer_model=vectorizer_model)
# Incrementally fit the topic model by training on 1000 documents at a time
for index in range(0, len(docs), 1000):
topic_model.partial_fit(docs[index: index+1000])
Hmmm, I have seen this issue with a recent scikit-learn update but it seems there isn't a fix as of yet. You could try the solution suggested here perhaps each time after a partial fit to see whether that helps.
Have you searched existing issues? 🔎
Desribe the bug
running the example code from the partial_fit example in the docs throws an error.
thanks for the help!
Stack trace:
Reproduction
copy and pasted from example code here
I'm on an M3 MacBook Pro Python 3.12.4 scikit-learn 1.5.1 bertopic 0.16.3 numpy 1.26.4 scipy 1.14.0
BERTopic Version
0.16.3