MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.79k stars 721 forks source link

about the word embedding and the topic embedding #1206

Closed ghost closed 1 year ago

ghost commented 1 year ago

Could I ask you some questions about topic embedding? Mentioned in the code that topic embedding is a TF-IDF standard weighted average of word embedding, so I would like to know which method you used to get the word embedding, is it the same model as sentence embedding, i.e. self.embedding_model,thanks

MaartenGr commented 1 year ago

Yes, to generate the word embeddings, the topic_model.embedding_model is used. The topic embedding, however, might change in the future to better support inference with the new PR. Instead, it is likely it will use the centroids instead and additionally generate topic embeddings using weighted c-TF-IDF embeddings separately for representation purposes. The centroids typically mimic the clustering algorithm well but not the topic representation in itself, whereas the weighted average of c-TF-IDF word embeddings is better the other way around. At least, in my experience.

ghost commented 1 year ago

Thanks for the answer, if the word embedding is generated with the same embedding model as the document embedding, in my past understanding, maybe the sentence embedding is constructed by pooling or [CLS] etc. through the word embedding, maybe we can save the word embedding result when generating the sentence embedding to reduce the amount of repetitive computation, although it increases the space complexity. I am also just getting started in natural language processing, the above are just some of my thoughts, your work has helped me a lot!

MaartenGr commented 1 year ago

It depends on the underlying embedding model as each model can do that differently. The default is sentence-transformers, which can be read about here. You can specify with a custom backend how both word and document embeddings are handled in BERTopic.

MaartenGr commented 1 year ago

Closing this due to inactivity. Let me know if I need to re-open the issue!