Closed saikot-paul closed 9 months ago
You can reduce the number of duplicate topics by decreasing zeroshot_min_similarity
. Also, note that you do not need to use ZeroShotClassification
here since that is already handled with zeroshot_min_similarity
. It might even lead to issues so I would advise removing that.
There are also some functions like .reduce_topics
and .merge_topics
that can reduce the number of duplicate topics.
Do you mind explaining why I shouldn't use ZeroShotClassification? Is the benefit to using ZeroShotClassification selecting which LLM used to generate labels?
Thank you.
That is because you are already using zero-shot topic modeling with zeroshot_min_similarity
and zeroshot_topic_list
. So your predefined labels might get overwritten with something else or the non-zeroshot topics might suddenly be zeroshot topics, which is what you want to prevent. Instead, I would advise using one of the LLM procedures if you want model-generated labels instead of model-assigned labels.
Apologies, but where can I find documentation regarding model generated vs model assigned labels?
This is model assigned: https://maartengr.github.io/BERTopic/getting_started/zeroshot/zeroshot.html
This is model generated: https://maartengr.github.io/BERTopic/getting_started/representation/representation.html#zero-shot-classification
If you check out the visualization above, the blue, red, and green Lego blocks represent assigning topics to documents. In contrast, the yellow, grey, and purple Lego blocks represent how the labels of topics are generated.
Thank you
Hello,
I was wondering if anyone has experienced duplicate topics being generated from the Zero Shot Topic List? I guess there are a couple of ways I can prevent this:
My question is how can I configure the model to prevent the duplicate topics from the beginning?
Below is code
` import pandas as pd from datasets import load_dataset from bertopic import BERTopic from bertopic.representation import ZeroShotClassification
zeroshot_topic_list = [...Topics....]
@memory_checker def get_zeroshot_topics(docs, zeroshot_topic_list): topic_model = BERTopic( embedding_model="sentence-transformers/all-MiniLM-L6-v2", min_topic_size=15, zeroshot_topic_list= zeroshot_topic_list, zeroshot_min_similarity=.55, representation_model=ZeroShotClassification(zeroshot_topiclist, model="facebook/bart-large-mnli") ) topics, = topic_model.fit_transform(docs)
return topics, topic_model `