Open matthewnour opened 1 year ago
Hmmm, that should be updated. There are now a number of ways topic embeddings will be calculated. As a default, the average of all embeddings of a topic will be taken to create the topic embedding. If that is not possible, then it will be indeed weighted average of word embeddings. These word embeddings can also be calculated using the sentence-level representations by simply giving it a single word and having it generate an embedding for that. You could separate the sentence-level and word-level embeddings but that is generally not necessary with sentence-transformers.
When exploring relationships between topics (2D visualisations, hierarchy) we need to represent each topic as a summary vector (cluster-level embedding).
The BERTopic source code stats
This seems to imply BERTopic needs both a sentence-level word embedding model and a word-level embedding model.
Is this the case? Where is this specified in the source code please?