Open frasercrichton opened 12 months ago
This is not expected behavior and I have not seen something like this before. Could you share your full code? Without it, it is difficult to understand what is happening here. Make sure to be as complete as possible.
embedding_model = SentenceTransformer('all-mpnet-base-v2')
embeddings = embedding_model.encode(docs, show_progress_bar=True)
umap_model = UMAP(n_neighbors=15,
n_components=5,
min_dist=0.0,
metric='cosine',
random_state=42)
hdbscan_model = HDBSCAN(
min_cluster_size=10,
metric='euclidean',
cluster_selection_method='eom',
min_samples=8, # added to reduce outliers
prediction_data=True)
vectorizer_model = CountVectorizer(stop_words="english")
# KeyBERT
keybert_model = KeyBERTInspired()
# Part-of-Speech
pos_model = PartOfSpeech("en_core_web_sm")
# MMR
mmr_model = MaximalMarginalRelevance(diversity=0.3)
# GPT-3.5
openai.api_key=os.environ['openai_api_key']
prompt = """
I have a topic that contains the following documents:
[DOCUMENTS]
The topic is described by the following keywords: [KEYWORDS]
Based on the information above, extract a short but highly descriptive topic label of at most 5 words. Make sure it is in the following format:
topic: <topic label>
"""
openai_model = OpenAI(model="gpt-3.5-turbo", exponential_backoff=True, chat=True, prompt=prompt)
from bertopic.representation import ZeroShotClassification
candidate_topics = [
'x',
# 'y',
'z,
]
zero_shot_model = ZeroShotClassification(candidate_topics, model="facebook/bart-large-mnli")
# representation_model = zero_shot_model
representation_model = {
"Main": zero_shot_model,
'KeyBERT': keybert_model,
# 'OpenAI': openai_model, # Uncomment if you will use OpenAI
'MMR': mmr_model,
# 'POS': pos_model,
# 'ZeroShot': zero_shot_model,
}
seed_topic_list = [
['x'],
]
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True, bm25_weighting=True)
topic_model = BERTopic(
embedding_model=embedding_model, # Step 1 - Extract embeddings
umap_model=umap_model, # Step 2 - Reduce dimensionality
hdbscan_model=hdbscan_model, # Step 3 - Cluster reduced embeddings
# vectorizer_model= # Step 4 - Tokenize topics. Don't do this! It removed the entire abortion topic.
ctfidf_model=ctfidf_model, # Step 5 - Extract topic words
representation_model=representation_model, # Step 6 - (Optional) Fine-tune topic representations
seed_topic_list= seed_topic_list,
min_topic_size=10, # 10 is the default nope
nr_topics=29, # 32
verbose=True,
n_gram_range=(1,3), # allows Brothers of Italy
calculate_probabilities=True,
)
topics, probs = topic_model.fit_transform(docs)
topic_labels = topic_model.generate_topic_labels(nr_words=3, topic_prefix=False, word_length=20, separator=', ')
topic_model.set_topic_labels(topic_labels)
# 889 outliers
topic_model.get_topic_info()
yes = [-1, 11] # yes, yes yes
thanks = [-1, 14] # you thank, you
good_morning = [-1, 23] # good morning
why = [-1, 27] # why, why why
topics_to_merge = [good_morning, why, thanks, yes]
print(topics_to_merge)
topic_model.merge_topics(docs, topics_to_merge=topics_to_merge)
topic_model.get_topic_info().head()
There is a lot happening in your code it seems along with some superfluous code (like the OpenAI part which you do not seem to use). Also, is this the exact code you are using? The reason that I ask is that the following does not seem like actual topics:
candidate_topics = [
'x',
# 'y',
'z,
]
Either way, I think the issue lies here:
topics_to_merge = [good_morning, why, thanks, yes]
Which I believe should be topic identifiers and not labels. See the example here: https://maartengr.github.io/BERTopic/getting_started/topicreduction/topicreduction.html
All in all, I would highly suggest going through the best practices guide as it shows a number of helpful tips and tricks for getting the topics that you are looking for.
Another thing I've noticed is that if I set the min_topic_size hyper-parameter it does nothing even though I'm using HDBSCAN. Is that something you have seen before?
This is expected behavior if you are using hdbscan_model
since min_topic_size
is essentially the min_cluster_size
parameter. In other words, it will be overwritten.
Are you saying that Python isn't treating this as a variable: yes = [-1, 11]
? I've been through the best practices and spent a significant amount of time working on this now.
And just for clarity:
seed_topic_list = [
['Brothers of Italy', 'brothers of italy', 'Italy', 'Italian'],
['we are ready'], # The FDI's Campaign slogan
['immigration', 'migration', 'migrants', 'refugee', 'traffickers'],
['abortion', 'abort', '194', 'law 194'],
['election', 'government', 'vote'],
['inflation', 'bills'],
['freedom'],
['rape', 'raped'],
['women'],
['climate' , 'environmental', 'ecological', 'sustainability'],
['fake', 'fake news', 'lies', 'journalism'],
['tax', 'income'],
['crime'],
['minimum wage'],
['Nazis', 'nazis'],
['pensions'],
['family', 'families'],
['pets', 'animals'], # added as pets get merge into the migrants topic
['russia']
]
candidate_topics = [
'migrants',
# 'immigration',
'abortion',
'fake news',
'Brothers of Italy',
'we are ready',
'rape',
'Nazis',
'minimum wage',
'ecological',
'green pass',
'russia'
'crime', # this is used to separate out crime from migration
'authoritarian',
'women',
# 'crime',
'inflation',
'citizenship',
'freedom',
'prices',
'pensions',
'tax',
'family',
# 'government'
]
Ooops, my bad! I completely misread that one. I thought they were strings instead of identifiers.
Instead, it might be related to duplicate topics in your examples. If I am not mistaken you intend to merge all of the following topics into outliers:
yes = [-1, 11] # yes, yes yes
thanks = [-1, 14] # you thank, you
good_morning = [-1, 23] # good morning
why = [-1, 27] # why, why why
topics_to_merge = [good_morning, why, thanks, yes]
I believe you will have to do this instead:
topics_to_merge = [-1, 11, 14, 23, 27]
That way, all of these topics will be merged together into the -1 topic. By repeating the same topic, -1, throughout all merging procedures, it will try to do it iteratively, which can result in issues.
Lastly, there is an open PR with upcoming features (zero-shot topic modeling instead of classification) that might be better suited to your specific use case. It can generate the specific topics (including labels) that you are looking for in candidate_topics
.
Oh, now that makes sense. Cool. Will have a look later today.
That PR looks like exactly what I'm after! Thanks Maarten.
Hey Maarten,
I've further refined my topic model but now I'm noticing a couple of weird issues.
If I merge topics (even a single topic into another single topic) my topic model appears to collapse from 32 topics to about 4 including the outliers. The same thing happens if I 'update_topics()' with the countVectoriser.
Another thing I've noticed is that if I set the min_topic_size hyper-parameter it does nothing even though I'm using HDBSCAN. Is that something you have seen before?
Cheers.