MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.19k stars 765 forks source link

serializing a model built with partial_fit #2196

Open chadlillian opened 1 month ago

chadlillian commented 1 month ago

Have you searched existing issues? 🔎

Desribe the bug

I have built a model with partial_fit (using the code found in the documentation). I then serialize the model with pickle, and safetensor. Then I load the model.

The loaded pickled model works only if I execute partial_fit a few times (<5) and throws the first error below if I executed partial_fit a lot of times (>60). The loaded safetensor model always throws the second error belo

ERROR 1 ############################## Traceback (most recent call last): File "", line 1, in File "/home/chad/anaconda/lib/python3.9/site-packages/bertopic/_bertopic.py", line 1218, in approximate_distribution similarity = cosine_similarity(c_tf_idf_doc, self.c_tfidf[self._outliers:]) File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 213, in wrapper return func(*args, **kwargs) File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/metrics/pairwise.py", line 1657, in cosine_similarity X, Y = check_pairwise_arrays(X, Y) File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/metrics/pairwise.py", line 164, in check_pairwise_arrays X = check_array( File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/utils/validation.py", line 917, in check_array array = _ensure_sparse_format( File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/utils/validation.py", line 593, in _ensure_sparse_format _assert_all_finite( File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/utils/validation.py", line 126, in _assert_all_finite _assert_all_finite_element_wise( File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/utils/validation.py", line 175, in _assert_all_finite_element_wise raise ValueError(msg_err) ValueError: Input contains infinity or a value too large for dtype('float64').

ERROR 2 #################################

Traceback (most recent call last): File "", line 1, in File "/home/chad/anaconda/lib/python3.9/site-packages/bertopic/_bertopic.py", line 1216, in approximate_distribution bow_doc = self.vectorizer_model.transform(all_sentences) File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/feature_extraction/text.py", line 1431, in transform self._check_vocabulary() File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/feature_extraction/text.py", line 508, in _check_vocabulary raise NotFittedError("Vocabulary not fitted or provided") sklearn.exceptions.NotFittedError: Vocabulary not fitted or provided

Reproduction

from bertopic import BERTopic

from bertopic import BERTopic from bertopic.vectorizers import ClassTfidfTransformer from umap import UMAP import pandas as pd import numpy as np from sklearn.decomposition import IncrementalPCA from sklearn.cluster import MiniBatchKMeans from bertopic.vectorizers import OnlineCountVectorizer

d = pd.HDFStore('wiki_edgelist_s.hdf') hdfkeys = d.keys() d.close()

umap_model = IncrementalPCA(n_components=5) cluster_model = MiniBatchKMeans(n_clusters=50, random_state=0) vectorizer_model = OnlineCountVectorizer(stop_words="english", decay=.01) embedding_model = "sentence-transformers/all-MiniLM-L6-v2"

topic_model = BERTopic(umap_model=umap_model, hdbscan_model=cluster_model, vectorizer_model=vectorizer_model,embedding_model=embedding_model)

frac = 0.01 nh = -1 topics = [] for i,hk in enumerate(hdfkeys[:nh]): df = pd.read_hdf('wiki_pages_s.hdf',key=hk)

dfi = df.sample(frac=frac,weights=df['num_words'])
docs = dfi['text'].tolist()
topic_model.partial_fit(docs)
topics.extend(topic_model.topics_)

size = 10 topicmodel.topics = topics topic_model.save('modelp%i.pkl'%size, serialization="pickle") topic_model.save('modelp%i'%size, serialization="safetensors",save_embedding_model=embedding_model,save_ctfidf=True)

###################### Loading Model: from bertopic import BERTopic import pandas as pd

d = pd.HDFStore('wiki_edgelist_s.hdf') hdfkeys = d.keys() d.close()

df = pd.read_hdf('wiki_pages_s.hdf',key=hdfkeys[0]) docs = df['text'].iloc[:10]

lm = BERTopic.load("model_p_10")

lm = BERTopic.load("model_p_10.pkl") lm.approximate_distribution(docs)

BERTopic Version

0.16.1

chadlillian commented 1 month ago

I added this to the end of my code before serialization, no errors, but I haven't validated the results yet.

z = CountVectorizer()
z.vocabulary_ = topic_model.vectorizer_model.vocabulary_
topic_model.vectorizer_model = z
MaartenGr commented 3 weeks ago

Thank you for sharing this. The model saved with safetensors will indeed not work since it does not save the underlying dimensionality reduction and clustering algorithms.

I am, however, surprised that this does not work with pickle since it should save the entire state. It seems you are using an older version of BERTopic, could you try using a newer version?

Ah, it might just be that the decay parameter is set too high and that after too many iterations, entire rows get 0 values.