MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.2k stars 765 forks source link

Duplicate Topics Generated: Zero Shot Classification #1747

Closed saikot-paul closed 9 months ago

saikot-paul commented 10 months ago

Hello,

I was wondering if anyone has experienced duplicate topics being generated from the Zero Shot Topic List? I guess there are a couple of ways I can prevent this:

  1. Decrease the threshold further
  2. Merge the two manually
  3. Adjust the clustering algorithm to create less clusters
  4. Add the maximal marginal relevance

My question is how can I configure the model to prevent the duplicate topics from the beginning?

Below is code

` import pandas as pd from datasets import load_dataset from bertopic import BERTopic from bertopic.representation import ZeroShotClassification

zeroshot_topic_list = [...Topics....]

@memory_checker def get_zeroshot_topics(docs, zeroshot_topic_list): topic_model = BERTopic( embedding_model="sentence-transformers/all-MiniLM-L6-v2", min_topic_size=15, zeroshot_topic_list= zeroshot_topic_list, zeroshot_min_similarity=.55, representation_model=ZeroShotClassification(zeroshot_topiclist, model="facebook/bart-large-mnli") ) topics, = topic_model.fit_transform(docs)

return topics, topic_model `

MaartenGr commented 10 months ago

You can reduce the number of duplicate topics by decreasing zeroshot_min_similarity. Also, note that you do not need to use ZeroShotClassification here since that is already handled with zeroshot_min_similarity. It might even lead to issues so I would advise removing that.

There are also some functions like .reduce_topics and .merge_topics that can reduce the number of duplicate topics.

saikot-paul commented 10 months ago

Do you mind explaining why I shouldn't use ZeroShotClassification? Is the benefit to using ZeroShotClassification selecting which LLM used to generate labels?

Thank you.

MaartenGr commented 10 months ago

That is because you are already using zero-shot topic modeling with zeroshot_min_similarity and zeroshot_topic_list. So your predefined labels might get overwritten with something else or the non-zeroshot topics might suddenly be zeroshot topics, which is what you want to prevent. Instead, I would advise using one of the LLM procedures if you want model-generated labels instead of model-assigned labels.

saikot-paul commented 10 months ago

Apologies, but where can I find documentation regarding model generated vs model assigned labels?

MaartenGr commented 10 months ago

This is model assigned: https://maartengr.github.io/BERTopic/getting_started/zeroshot/zeroshot.html

This is model generated: https://maartengr.github.io/BERTopic/getting_started/representation/representation.html#zero-shot-classification

image

If you check out the visualization above, the blue, red, and green Lego blocks represent assigning topics to documents. In contrast, the yellow, grey, and purple Lego blocks represent how the labels of topics are generated.

saikot-paul commented 10 months ago

Thank you