Open donottakemyusername opened 1 year ago
I am not sure whether you actually did something wrong here. Could you share your full code for training and saving the model? I think you could still use serialization="pickle"
but that might not be what you are looking for.
Hi @MaartenGr I'm experiencing the same problem. Here is my code:
class WrappedRiverClusterAlgo:
"""Wraps a River model so that it can be used to train the model in chunks of data similar
to online training
"""
def __init__(self, model):
self.model = model
def partial_fit(self, umap_embeddings):
for umap_embedding, _ in stream.iter_array(umap_embeddings):
self.model = self.model.learn_one(umap_embedding)
labels = []
for umap_embedding, _ in stream.iter_array(umap_embeddings):
label = self.model.predict_one(umap_embedding)
labels.append(label)
self.labels_ = labels
return self
# Step 1 - Extract embeddings
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
# Step 2 - Reduce dimensionality
umap_model = IncrementalPCA(n_components=5)
# Step 3 - Cluster reduced embeddings
cluster_model = WrappedRiverClusterAlgo(cluster.CluStream())
# Step 4 - Tokenize topics
vectorizer_model = OnlineCountVectorizer(stop_words="english", decay=.01, delete_min_df=10.00,
ngram_range=(2,2))
# Step 5 - Create topic representation
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
representation_model = KeyBERTInspired(nr_repr_docs=15,random_state=100)
# All steps together
topic_model = BERTopic(
embedding_model=embedding_model,
umap_model=umap_model,
hdbscan_model=cluster_model,
vectorizer_model=vectorizer_model,
ctfidf_model=ctfidf_model,
calculate_probabilities=True,
representation_model=representation_model,
nr_topics="auto",
verbose=True)
for data in dataset:
topic_model.partial_fit(data)
topics.extend(topic_model.topics_)
# Update model topics attribute
topic_model.topics_ = topics
# Save the model
topic_model.save(model_safatensors_path, serialization="safetensors", save_ctfidf=True,
save_embedding_model="sentence-transformers/all-MiniLM-L6-v2")
Additionally, here is the backtrace;
Traceback (most recent call last):
File "/Desktop/projects/app/runner_model.py", line 179, in <module>
model()
File "/.cache/pypoetry/virtualenvs/DQ6JMim6-py3.10/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/.cache/pypoetry/virtualenvs/DQ6JMim6-py3.10/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/.cache/pypoetry/virtualenvs/DQ6JMim6-py3.10/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/.cache/pypoetry/virtualenvs/DQ6JMim6-py3.10/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/.cache/pypoetry/virtualenvs/DQ6JMim6-py3.10/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/Desktop/projects/app/runner_model.py", line 133, in model_with_bert_topic
use_mmr=usemmr,use_keybert=usekeybert).model()
File "/Desktop/projects/app/app/nlp_engine/use/__init__.py", line 142, in model
self.online_training(WrappedRiverClusterAlgo(cluster.CluStream()))
File "/Desktop/projects/app/app/nlp_engine/use/__init__.py", line 204, in online_training
topic_model.save(model_safatensors_path,
File "/.cache/pypoetry/virtualenvs/DQ6JMim6-py3.10/lib/python3.10/site-packages/bertopic/_bertopic.py", line 2963, in save
save_utils.save_ctfidf_config(model=self, path=save_directory / 'ctfidf_config.json')
File "/.cache/pypoetry/virtualenvs/DQ6JMim6-py3.10/lib/python3.10/site-packages/bertopic/_save_utils.py", line 350, in save_ctfidf_config
del cv_params["tokenizer"], cv_params["preprocessor"], cv_params["dtype"]
KeyError: 'tokenizer'
Yeah I think my code is similar. The problem is for our model countvectorizer, there is no parameters such as "tokenizer" or "preprocessor". When I called model.vectorizer_model.get_params()
and it only contains 2 parameters: {'decay': 0.1, 'delete_min_df': None}
. So when the save_ctfidf_config
function calls del cv_params["tokenizer"], cv_params["preprocessor"], cv_params["dtype"]
without checking, it returns a key error. I am not entirely sure the point of deleting these parameters without checking, I am just going to try to remove those lines and see if things work properly. But in the meantime if you can let us know if there is anything we can do it would be really helpful. Thank you!
Just an update, I feel like the serialization technique does not work for incremental learning techniques which use OnlineCountVectorizer. It only works for regular CountVectorizer. Please correct me if I am wrong.
I think this is an issue with OnlineCountVectorizer
not properly inheriting everything from its base class CountVectorizer
. Those lines should not be removed since you would not be able to re-create the vectorizer. I believe two things need to change:
OnlineCountVectorizer
OnlineCountVectorizer
specific parameters.
Hi, I am using the partial_fit function to perform incremental learning with BERTopic. When I tried to save the BERTopic model using safetensors, I got the following error: KeyError: 'tokenizer'. The error was raised in bertopic/_save_utils.py when the function tries to recreate the countvectorizer delete the parameters in cv but they don't actually exist. I tried to save the model using the code: model.save('some_directory', serialization="safetensors", save_ctfidf=True), and here is the error code I got: /python3.9/site-packages/bertopic/_save_utils.py in save_ctfidf_config(model, path) 293 # Recreate CountVectorizer 294 cv_params = model.vectorizer_model.get_params() --> 295 del cv_params["tokenizer"], cv_params["preprocessor"], cv_params["dtype"] 296 if not isinstance(cv_params["analyzer"], str): 297 del cv_params["analyzer"]
KeyError: 'tokenizer'
I have run the function model.vectorizer_model.get_params() and it only contains 2 parameters: {'decay': 0.05, 'delete_min_df': None}. Is there anything I've done wrong? Thank you!