MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.15k stars 765 forks source link

find_topics returns the same topics for any keywords #1426

Open aaron-imani opened 1 year ago

aaron-imani commented 1 year ago

Hello, I have fit the BERTopic model using CUML's HDBSCAN and UMAP. I used microsoft/codebert-base on huggingface as the embedding model like this:

model = pipeline("feature-extraction", 
                         model="microsoft/codebert-base", 
                         device=0)
topic_model = BERTopic(embedding_model=model,
                              umap_model=umap_model,
                              hdbscan_model=hdbscan_model,
                              calculate_probabilities=True,
                              )
topic_model = topic_model.fit(docs)

topic_model.save("models/codebert", 
                serialization="safetensors", 
                save_ctfidf=True)

The above code runs without any problems. This is how I load the model:

embedding_model = 'microsoft/codebert-base'
model = pipeline('feature-extraction', embedding_model)
model = HFTransformerBackend(model)
topic_model = BERTopic.load("models/codebert",
                            embedding_model=model)

Although invoking the get_topic_info method from topic_model returns a list of meaningful topics, when I call find_topics, no matter what the term is, a list of the same topics is returned with almost the same probabilities. The same thing happened when I used another Huggingface model. Are there any potential workarounds?

MaartenGr commented 1 year ago

You do not seem to be loading in the backend the same way between saving and loading the topic model. Could you apply it in the same way and try it out again?

Also, could you show an example of the issue you experience with find_topics? It would help in understanding the issue.

aaron-imani commented 1 year ago

I tried loading it in the same way as well, but it still shows the same behavior:

embedding_model = 'microsoft/codebert-base'
hf_pipeline = pipeline('feature-extraction', embedding_model)
topic_model = BERTopic.load("models/codebert",
                            embedding_model=hf_pipeline)

Here is an example of trying different keywords:

image image

There are two problems with the output: 1- Similarity of topics is so high (It shouldn't be like that based on my experiment with sentence transformers) 2- Although the topics within the topic model are coherent, the returned topics are irrelevant to the search query.

MaartenGr commented 1 year ago

1- Similarity of topics is so high (It shouldn't be like that based on my experiment with sentence transformers)

Ah, now I understand! It is actually expected behavior since the models within sentence-transformers are optimized for similarity tasks, regular BERT-like models are not and will typically output very high similarity scores. That is also the reason why sentence-transformers are the default models used in BERTopic, they outperform regular BERT models by a very large margin. An overview of very strong models to be used in BERTopic can be found here.

aaron-imani commented 1 year ago

I see. Thank you for your guidance! Should I look at the "Clustering" section of the provided link for the comparison? Which tab includes the appropriate comparison?

MaartenGr commented 1 year ago

You would generally look at the clustering tab.