MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.08k stars 757 forks source link

A Question on Ensemble Models for BERTopic #228

Closed gsalfourn closed 3 years ago

gsalfourn commented 3 years ago

Hi Maarten,

I was just thinking about an ensemble model using BERTopic, and I have a question for you in that regard. I was wondering if you have any advice about ensembling topics from different Embedding models - Sentence Transformers, Flair, Spacy, and USE.

MaartenGr commented 3 years ago

There are a few ways of approaching this.

First, you can use Flair (here or here) to combine/stack/pool different embedding models. The resulting model can simply be passed to BERTopic.

Second, you can create your own custom embedding model and pass that to BERTopic (see documentation here). If you have a custom way of combining/ensembling embedding models, then that might be the preferred method.

Does that answer your question?

gsalfourn commented 3 years ago

yes, it does. Thanks so much

Based on your advice, I did some custom modeling word + document embeddings. Posting what I did here in case someone else wants to try something like that in the future.

# custom embeddings
from bertopic.backend import WordDocEmbedder

# contextual word embeddings: flair
from flair.embeddings import FlairEmbeddings

embed_forward = FlairEmbeddings('news-forward-fast')
embed_backward = FlairEmbeddings('news-backward-fast')

# document embeddings
# import tf hub
import tensorflow_hub as hub
tfhub_model = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

# import sentence transformers
from sentence_transformers import SentenceTransformer
sentence_model = SentenceTransformer('paraphrase-MiniLM-L6-v2') # default model to use

# create a model that uses both language models and pass it through BERTopic
word_doc_embedder = WordDocEmbedder(word_embedding_model=[embed_forward, embed_backward],
                                    embedding_model=[tfhub_model, sentence_model])

topic_model = BERTopic(top_n_words=10,
                        min_topic_size=50,
                        nr_topics='auto',
                        calculate_probabilities=True,
                        embedding_model=word_doc_embedder,
                        low_memory=False,
                        verbose=False)

# train model, extract topics and probabilities
topics, probs = topic_model.fit_transform(docs)

As an aside, I would like to say that it's very magnanimous of you taking the time from a very busy schedule to respond to each of our queries. I am really appreciative of that.