Open vidieo opened 3 months ago
Strange, it seems that they are updated for some but not all others. If I'm not mistaken, topic 396 is not properly updated right but topic 0 is?
Also, can you share your full code along with the versions of your environment?
Thanks for such a great project and the quick response @MaartenGr! The additional representations does not get updated with the reduce_topics
method, so for example topic 396 here has the KeyBERT and MMR of the old topic 396. It was just a coincidence before that the first three topics before and after reduction were similar. After a few more runs I learned that this happened only when loading a saved model since no sub-models is saved with it. Is there a way to pass these submodels so I can tweak the topics of a saved model?
I am running bertopic 0.16.2 on Python 3.10.12.
The code:
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance
from bertopic import BERTopic
import pickle
embedding_model = SentenceTransformer("all-mpnet-base-v2")
with open("/content/drive/MyDrive/code_stuff/mpnet_embeddings.pickle", "rb") as pkl:
embeddings = pickle.load(pkl)
umap_model = UMAP(n_neighbors=20, n_components=5, min_dist=0.0,
metric="cosine", random_state=42)
hdbscan_model = HDBSCAN(min_cluster_size=20, min_samples=10,
metric="euclidean", cluster_selection_method="eom",
prediction_data=True)
vectorizer_model = CountVectorizer(stop_words="english", min_df=5, max_df=0.9,
ngram_range=(1, 3))
keybert_model = KeyBERTInspired()
mmr_model = MaximalMarginalRelevance(diversity=0.3)
representation_model = {"KeyBERT": keybert_model,
"MMR": mmr_model}
topic_model = BERTopic(
embedding_model=embedding_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
vectorizer_model=vectorizer_model,
representation_model=representation_model,
top_n_words=10,
verbose=True,
)
topics, _ = topic_model.fit_transform(docs, embeddings=embeddings)
topic_model.reduce_topics(docs, nr_topics=400)
topic_model.get_topic_info()
After a few more runs I learned that this happened only when loading a saved model since no sub-models is saved with it. Is there a way to pass these submodels so I can tweak the topics of a saved model?
I'm not sure if I understand correctly. The code you shared does not show loading a saved BERTopic model right?
Also, if you need to use nr_topics
(which is not something I would recommend), you could also use that parameter in BERTopic(nr_topics=400)
. That might work for you.
I'm not sure if I understand correctly. The code you shared does not show loading a saved BERTopic model right?
Sorry, that's my bad. I shared the original code and not the code for the subsequent runs when I loaded the model. Again, it only happens when loading a saved model, so I will be fine. Still looking into the best way to reduce the number of topics for my case as I do want the small clusters if they are distinct enough, that's why I'm looking into merging methods.
@vidieo Then it might indeed be helpful to start with min_topic_size
to find the number of topics you are interested in and then manually merge topics instead of using .reduce_topics
. If you run into any other problems, let me know!
Hi, I am trying to reduce the number of topics that I have with
topic_model.reduce_topics(docs, nr_topics=400)
which works fine. However, when I rantopic_model.get_topic_info()
I got mismatched representations. Only the main representation was updated and all the other aspects were from the old topics.I understand the preferred method of controlling topic number is
min_cluster_size
which I did use, but it would be nice to know if I could usereduce_topics
with the additional representations updated. Thanks in advance!