MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.97k stars 747 forks source link

Where is the full data set of embeddings? #1988

Open vegabook opened 3 months ago

vegabook commented 3 months ago

I'm passing a list of approximately 700 articles to the default embeddings function as follows:

topic_model = BERTopic()
window_result = topic_model.fit_transform(d)

 where d is a simple list of strings (representing approximately 200 word articles each).  However when I topic_model.get_info() I see only 11 topics, and their associated embeddings using attribute topic_embeddings_. Where can I get the full 700 actual embedding vectors?

MaartenGr commented 3 months ago

Where can I get the full 700 actual embedding vectors?

The embeddings of the documents are not saved in the model as that would blow up the model's size. If you want the embeddings of the documents, I would advise reading through the best practices.

vegabook commented 3 months ago

I see. So when you say, in the best practises that embeddings can be pre-calculated and fed to the model "especially if you want to iterate over parameters", this assumes that you wish to explore parameters other than changing the embedding model, right?

IE: calculate embeddings once, explore clustering, dimensionality reduction etc using same embeddings?

So the phrase should not contain the word "especially", it should read "This process can be very costly, if we want to iterate over [other] parameters"?

Apologies for the semantic pedantry but I just want to make sure I understand what you're saying correctly and that I'm not missing something.

MaartenGr commented 3 months ago

It is indeed meant to say that when you wish to explore parameters other than changing the embedding model, pre-calculating the embeddings is preferred.