MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.08k stars 757 forks source link

Merging topics #1588

Open frasercrichton opened 12 months ago

frasercrichton commented 12 months ago

Hey Maarten,

I've further refined my topic model but now I'm noticing a couple of weird issues.

If I merge topics (even a single topic into another single topic) my topic model appears to collapse from 32 topics to about 4 including the outliers. The same thing happens if I 'update_topics()' with the countVectoriser.

Another thing I've noticed is that if I set the min_topic_size hyper-parameter it does nothing even though I'm using HDBSCAN. Is that something you have seen before?

Cheers.

MaartenGr commented 12 months ago

This is not expected behavior and I have not seen something like this before. Could you share your full code? Without it, it is difficult to understand what is happening here. Make sure to be as complete as possible.

frasercrichton commented 12 months ago

embedding_model = SentenceTransformer('all-mpnet-base-v2')
embeddings = embedding_model.encode(docs, show_progress_bar=True)

umap_model = UMAP(n_neighbors=15, 
                  n_components=5, 
                  min_dist=0.0, 
                  metric='cosine', 
                  random_state=42)

hdbscan_model = HDBSCAN(
    min_cluster_size=10, 
    metric='euclidean', 
    cluster_selection_method='eom', 
    min_samples=8, # added to reduce outliers
    prediction_data=True)

vectorizer_model = CountVectorizer(stop_words="english")

# KeyBERT
keybert_model = KeyBERTInspired()

# Part-of-Speech
pos_model = PartOfSpeech("en_core_web_sm")

# MMR
mmr_model = MaximalMarginalRelevance(diversity=0.3)

# GPT-3.5
openai.api_key=os.environ['openai_api_key'] 
prompt = """
I have a topic that contains the following documents:
[DOCUMENTS]
The topic is described by the following keywords: [KEYWORDS]

Based on the information above, extract a short but highly descriptive topic label of at most 5 words. Make sure it is in the following format:
topic: <topic label>
"""
openai_model = OpenAI(model="gpt-3.5-turbo", exponential_backoff=True, chat=True, prompt=prompt)

from bertopic.representation import ZeroShotClassification

candidate_topics = [
    'x', 
    # 'y',
    'z, 
    ]

zero_shot_model = ZeroShotClassification(candidate_topics, model="facebook/bart-large-mnli")

# representation_model = zero_shot_model

representation_model = {
    "Main": zero_shot_model,
    'KeyBERT': keybert_model,
    # 'OpenAI': openai_model,  # Uncomment if you will use OpenAI
    'MMR': mmr_model,
    # 'POS': pos_model,
    # 'ZeroShot': zero_shot_model,
}

seed_topic_list = [
    ['x'],
  ]

ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True, bm25_weighting=True)

topic_model = BERTopic(
  embedding_model=embedding_model,          # Step 1 - Extract embeddings
  umap_model=umap_model,                    # Step 2 - Reduce dimensionality
  hdbscan_model=hdbscan_model,              # Step 3 - Cluster reduced embeddings
  # vectorizer_model=                       # Step 4 - Tokenize topics. Don't do this! It removed the entire abortion topic.
  ctfidf_model=ctfidf_model,                # Step 5 - Extract topic words
  representation_model=representation_model, # Step 6 - (Optional) Fine-tune topic representations
  seed_topic_list= seed_topic_list,
  min_topic_size=10, # 10 is the default nope
  nr_topics=29, # 32
  verbose=True,
  n_gram_range=(1,3), # allows Brothers of Italy
  calculate_probabilities=True,
)

topics, probs = topic_model.fit_transform(docs)
topic_labels = topic_model.generate_topic_labels(nr_words=3, topic_prefix=False, word_length=20, separator=', ')
topic_model.set_topic_labels(topic_labels)
# 889 outliers
topic_model.get_topic_info()

yes             = [-1, 11]    # yes, yes yes
thanks          = [-1, 14]    # you thank, you
good_morning    = [-1, 23]    # good morning
why             = [-1, 27]    # why, why why

topics_to_merge = [good_morning, why, thanks, yes]
print(topics_to_merge)
topic_model.merge_topics(docs, topics_to_merge=topics_to_merge)     
topic_model.get_topic_info().head()
MaartenGr commented 12 months ago

There is a lot happening in your code it seems along with some superfluous code (like the OpenAI part which you do not seem to use). Also, is this the exact code you are using? The reason that I ask is that the following does not seem like actual topics:

candidate_topics = [
    'x', 
    # 'y',
    'z, 
    ]

Either way, I think the issue lies here:

topics_to_merge = [good_morning, why, thanks, yes]

Which I believe should be topic identifiers and not labels. See the example here: https://maartengr.github.io/BERTopic/getting_started/topicreduction/topicreduction.html

All in all, I would highly suggest going through the best practices guide as it shows a number of helpful tips and tricks for getting the topics that you are looking for.

Another thing I've noticed is that if I set the min_topic_size hyper-parameter it does nothing even though I'm using HDBSCAN. Is that something you have seen before?

This is expected behavior if you are using hdbscan_model since min_topic_size is essentially the min_cluster_size parameter. In other words, it will be overwritten.

frasercrichton commented 12 months ago

Are you saying that Python isn't treating this as a variable: yes = [-1, 11]? I've been through the best practices and spent a significant amount of time working on this now.

frasercrichton commented 12 months ago

And just for clarity:

seed_topic_list = [  
    ['Brothers of Italy', 'brothers of italy', 'Italy', 'Italian'],
    ['we are ready'], # The FDI's Campaign slogan     
    ['immigration', 'migration', 'migrants', 'refugee', 'traffickers'],
    ['abortion', 'abort', '194', 'law 194'],
    ['election', 'government', 'vote'],
    ['inflation', 'bills'],
    ['freedom'],
    ['rape', 'raped'],
    ['women'],

    ['climate' , 'environmental', 'ecological',  'sustainability'],
    ['fake', 'fake news', 'lies', 'journalism'],
    ['tax', 'income'],
    ['crime'],
    ['minimum wage'],
    ['Nazis', 'nazis'],
    ['pensions'],
    ['family', 'families'],
    ['pets', 'animals'], # added as pets get merge into the migrants topic    
    ['russia']       
    ]
candidate_topics = [
    'migrants', 
    # 'immigration',
    'abortion', 
    'fake news', 
    'Brothers of Italy', 
    'we are ready',
    'rape',
    'Nazis',
    'minimum wage',
    'ecological',
    'green pass',
    'russia'
    'crime', # this is used to separate out crime from migration
    'authoritarian',

    'women',
    # 'crime', 

    'inflation', 
    'citizenship', 
    'freedom',
    'prices',
    'pensions',
    'tax',
    'family',
    # 'government'    
    ]
MaartenGr commented 12 months ago

Ooops, my bad! I completely misread that one. I thought they were strings instead of identifiers.

Instead, it might be related to duplicate topics in your examples. If I am not mistaken you intend to merge all of the following topics into outliers:

yes             = [-1, 11]    # yes, yes yes
thanks          = [-1, 14]    # you thank, you
good_morning    = [-1, 23]    # good morning
why             = [-1, 27]    # why, why why

topics_to_merge = [good_morning, why, thanks, yes]

I believe you will have to do this instead:

topics_to_merge = [-1, 11, 14, 23, 27]

That way, all of these topics will be merged together into the -1 topic. By repeating the same topic, -1, throughout all merging procedures, it will try to do it iteratively, which can result in issues.

Lastly, there is an open PR with upcoming features (zero-shot topic modeling instead of classification) that might be better suited to your specific use case. It can generate the specific topics (including labels) that you are looking for in candidate_topics.

frasercrichton commented 12 months ago

Oh, now that makes sense. Cool. Will have a look later today.

That PR looks like exactly what I'm after! Thanks Maarten.