MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.76k stars 716 forks source link

Zeroshot Topic Modeling With no Embedding Model #2011

Open amirarsalan90 opened 1 month ago

amirarsalan90 commented 1 month ago

Hello @MaartenGr and thanks for the awesome bertopic library! I want to perform zeroshot topic modeling with no embedding model. I have used an external model to get embeddings of documents and zeroshot topic list. I have no access to that embedding model anymore.

Is it possible to run something like this without embedding model?

zeroshot_topic_list_embeddings = np.random.rand(len(zeroshot_topic_list), 1024).astype(np.float32)
document_embeddings = np.random.rand(len(docs), 1024).astype(np.float32)

sim = 0.8
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
representation_model = KeyBERTInspired(top_n_words=200)
topic_model = BERTopic(
    top_n_words = 20,
    ctfidf_model=ctfidf_model,
    verbose=True,
    calculate_probabilities = True,
    embedding_model=None,
    min_topic_size=200,
    zeroshot_topic_list=zeroshot_topic_list,
    zeroshot_min_similarity=sim,
    representation_model=representation_model
)
topics, probs = topic_model.fit_transform(docs,document_embeddings)
topics, probs = topic_model.transform(docs,document_embeddings)

freq = topic_model.get_topic_info()

I think somewhere in the code Bertopic is still trying to use the embedding model

MaartenGr commented 1 month ago

I think somewhere in the code Bertopic is still trying to use the embedding model

That's correct! However, not because of zero-shot topic modeling but because you are using KeyBERTInspired. That representation model creates word embeddings that need to be used in order to find which words are semantically similar to a collection of representative documents. As such, an embedding model is still needed for that particular representation model.