Open chadlillian opened 1 month ago
I added this to the end of my code before serialization, no errors, but I haven't validated the results yet.
z = CountVectorizer()
z.vocabulary_ = topic_model.vectorizer_model.vocabulary_
topic_model.vectorizer_model = z
Thank you for sharing this. The model saved with safetensors
will indeed not work since it does not save the underlying dimensionality reduction and clustering algorithms.
I am, however, surprised that this does not work with pickle since it should save the entire state. It seems you are using an older version of BERTopic, could you try using a newer version?
Ah, it might just be that the decay
parameter is set too high and that after too many iterations, entire rows get 0 values.
Have you searched existing issues? 🔎
Desribe the bug
I have built a model with partial_fit (using the code found in the documentation). I then serialize the model with pickle, and safetensor. Then I load the model.
The loaded pickled model works only if I execute partial_fit a few times (<5) and throws the first error below if I executed partial_fit a lot of times (>60). The loaded safetensor model always throws the second error belo
ERROR 1 ############################## Traceback (most recent call last): File "", line 1, in
File "/home/chad/anaconda/lib/python3.9/site-packages/bertopic/_bertopic.py", line 1218, in approximate_distribution
similarity = cosine_similarity(c_tf_idf_doc, self.c_tfidf[self._outliers:])
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 213, in wrapper
return func(*args, **kwargs)
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/metrics/pairwise.py", line 1657, in cosine_similarity
X, Y = check_pairwise_arrays(X, Y)
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/metrics/pairwise.py", line 164, in check_pairwise_arrays
X = check_array(
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/utils/validation.py", line 917, in check_array
array = _ensure_sparse_format(
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/utils/validation.py", line 593, in _ensure_sparse_format
_assert_all_finite(
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/utils/validation.py", line 126, in _assert_all_finite
_assert_all_finite_element_wise(
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/utils/validation.py", line 175, in _assert_all_finite_element_wise
raise ValueError(msg_err)
ValueError: Input contains infinity or a value too large for dtype('float64').
ERROR 2 #################################
Traceback (most recent call last): File "", line 1, in
File "/home/chad/anaconda/lib/python3.9/site-packages/bertopic/_bertopic.py", line 1216, in approximate_distribution
bow_doc = self.vectorizer_model.transform(all_sentences)
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/feature_extraction/text.py", line 1431, in transform
self._check_vocabulary()
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/feature_extraction/text.py", line 508, in _check_vocabulary
raise NotFittedError("Vocabulary not fitted or provided")
sklearn.exceptions.NotFittedError: Vocabulary not fitted or provided
Reproduction
from bertopic import BERTopic from bertopic.vectorizers import ClassTfidfTransformer from umap import UMAP import pandas as pd import numpy as np from sklearn.decomposition import IncrementalPCA from sklearn.cluster import MiniBatchKMeans from bertopic.vectorizers import OnlineCountVectorizer
d = pd.HDFStore('wiki_edgelist_s.hdf') hdfkeys = d.keys() d.close()
umap_model = IncrementalPCA(n_components=5) cluster_model = MiniBatchKMeans(n_clusters=50, random_state=0) vectorizer_model = OnlineCountVectorizer(stop_words="english", decay=.01) embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
topic_model = BERTopic(umap_model=umap_model, hdbscan_model=cluster_model, vectorizer_model=vectorizer_model,embedding_model=embedding_model)
frac = 0.01 nh = -1 topics = [] for i,hk in enumerate(hdfkeys[:nh]): df = pd.read_hdf('wiki_pages_s.hdf',key=hk)
size = 10 topicmodel.topics = topics topic_model.save('modelp%i.pkl'%size, serialization="pickle") topic_model.save('modelp%i'%size, serialization="safetensors",save_embedding_model=embedding_model,save_ctfidf=True)
###################### Loading Model: from bertopic import BERTopic import pandas as pd
d = pd.HDFStore('wiki_edgelist_s.hdf') hdfkeys = d.keys() d.close()
df = pd.read_hdf('wiki_pages_s.hdf',key=hdfkeys[0]) docs = df['text'].iloc[:10]
lm = BERTopic.load("model_p_10")
lm = BERTopic.load("model_p_10.pkl") lm.approximate_distribution(docs)
BERTopic Version
0.16.1