MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.03k stars 756 forks source link

Additional representations did not update with topic reduction #2035

Open vidieo opened 3 months ago

vidieo commented 3 months ago

Hi, I am trying to reduce the number of topics that I have with topic_model.reduce_topics(docs, nr_topics=400) which works fine. However, when I ran topic_model.get_topic_info() I got mismatched representations. Only the main representation was updated and all the other aspects were from the old topics.

image

I understand the preferred method of controlling topic number is min_cluster_size which I did use, but it would be nice to know if I could use reduce_topics with the additional representations updated. Thanks in advance!

MaartenGr commented 3 months ago

Strange, it seems that they are updated for some but not all others. If I'm not mistaken, topic 396 is not properly updated right but topic 0 is?

Also, can you share your full code along with the versions of your environment?

vidieo commented 3 months ago

Thanks for such a great project and the quick response @MaartenGr! The additional representations does not get updated with the reduce_topics method, so for example topic 396 here has the KeyBERT and MMR of the old topic 396. It was just a coincidence before that the first three topics before and after reduction were similar. After a few more runs I learned that this happened only when loading a saved model since no sub-models is saved with it. Is there a way to pass these submodels so I can tweak the topics of a saved model?

I am running bertopic 0.16.2 on Python 3.10.12.

The code:

from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance
from bertopic import BERTopic
import pickle

embedding_model = SentenceTransformer("all-mpnet-base-v2")

with open("/content/drive/MyDrive/code_stuff/mpnet_embeddings.pickle", "rb") as pkl:
    embeddings = pickle.load(pkl)

umap_model = UMAP(n_neighbors=20, n_components=5, min_dist=0.0,
                  metric="cosine", random_state=42)

hdbscan_model = HDBSCAN(min_cluster_size=20, min_samples=10,
                        metric="euclidean", cluster_selection_method="eom",
                        prediction_data=True)

vectorizer_model = CountVectorizer(stop_words="english", min_df=5, max_df=0.9,
                                   ngram_range=(1, 3))

keybert_model = KeyBERTInspired()
mmr_model = MaximalMarginalRelevance(diversity=0.3)

representation_model = {"KeyBERT": keybert_model,
                        "MMR": mmr_model}

topic_model = BERTopic(
  embedding_model=embedding_model,
  umap_model=umap_model,
  hdbscan_model=hdbscan_model,
  vectorizer_model=vectorizer_model,
  representation_model=representation_model,

  top_n_words=10,
  verbose=True,
)

topics, _ = topic_model.fit_transform(docs, embeddings=embeddings)

topic_model.reduce_topics(docs, nr_topics=400)

topic_model.get_topic_info()
MaartenGr commented 3 months ago

After a few more runs I learned that this happened only when loading a saved model since no sub-models is saved with it. Is there a way to pass these submodels so I can tweak the topics of a saved model?

I'm not sure if I understand correctly. The code you shared does not show loading a saved BERTopic model right?

Also, if you need to use nr_topics (which is not something I would recommend), you could also use that parameter in BERTopic(nr_topics=400). That might work for you.

vidieo commented 3 months ago

I'm not sure if I understand correctly. The code you shared does not show loading a saved BERTopic model right?

Sorry, that's my bad. I shared the original code and not the code for the subsequent runs when I loaded the model. Again, it only happens when loading a saved model, so I will be fine. Still looking into the best way to reduce the number of topics for my case as I do want the small clusters if they are distinct enough, that's why I'm looking into merging methods.

MaartenGr commented 3 months ago

@vidieo Then it might indeed be helpful to start with min_topic_size to find the number of topics you are interested in and then manually merge topics instead of using .reduce_topics. If you run into any other problems, let me know!