MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.97k stars 747 forks source link

Does BERTopic rely on *both* sentence_embeddings and word_embeddings #1403

Open matthewnour opened 1 year ago

matthewnour commented 1 year ago

When exploring relationships between topics (2D visualisations, hierarchy) we need to represent each topic as a summary vector (cluster-level embedding).

The BERTopic source code stats

topic_embeddings_ (np.ndarray) : The embeddings for each topic. It is calculated by taking the weighted average of word embeddings in a topic based on their c-TF-IDF values.

This seems to imply BERTopic needs both a sentence-level word embedding model and a word-level embedding model.

Is this the case? Where is this specified in the source code please?

MaartenGr commented 1 year ago

Hmmm, that should be updated. There are now a number of ways topic embeddings will be calculated. As a default, the average of all embeddings of a topic will be taken to create the topic embedding. If that is not possible, then it will be indeed weighted average of word embeddings. These word embeddings can also be calculated using the sentence-level representations by simply giving it a single word and having it generate an embedding for that. You could separate the sentence-level and word-level embeddings but that is generally not necessary with sentence-transformers.