MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.03k stars 756 forks source link

Chosen represented Topic #2048

Open mahmawad opened 3 months ago

mahmawad commented 3 months ago

First I would like to thank you for your great tool.

I have a question, This is one of the topic Representation in my Documents :

"President's son trial in Manhattan"

However most of the documents under this topic aren't related to Hunter Biden but yes mostly talk about politics,

is there a way to make the representation more general ?

MaartenGr commented 3 months ago

First I would like to thank you for your great tool.

Thank you for the kind words!

is there a way to make the representation more general ?

It's difficult to say without seeing the full code, versions, output of .topic_info, etc. For instance, it's not clear to me which topic representation model that you use. Could you provide a bit more information? I need your full training code, version of BERTopic, and the output when running .get_topic_info.

This is one of the topic Representation in my Documents : "President's son trial in Manhattan"

Is this a representative document or the topic representation?

mahmawad commented 3 months ago

This is the full code :

def get_topic_modeling(df, prompt, model, tokenizer): """ Generates a topic model for a given DataFrame using various NLP and clustering techniques.

Args:
df (pd.DataFrame): DataFrame containing the preprocessed text data.
prompt (str): The prompt for the text generation model.
model (str): The name or path of the model to be used for text generation.
tokenizer (str): The tokenizer to be used with the model.

Returns:
topic_model: The trained BERTopic model.
topics: The topics identified by the model.
probs: The probabilities of the topics.
"""
from sentence_transformers import SentenceTransformer
from torch import bfloat16
import transformers
from torch import cuda
import pandas as pd

# Initialize text generation pipeline
generator = transformers.pipeline(
    model=model, 
    tokenizer=tokenizer,
    task='text-generation',
    temperature=0.1,
    max_new_tokens=20,
    repetition_penalty=1.1
)

# Pre-calculate embeddings using SentenceTransformer
embedding_model = SentenceTransformer("all-mpnet-base-v2")
embeddings = embedding_model.encode(df['PreprocessedText'].tolist(), show_progress_bar=True)

from umap import UMAP
# Initialize UMAP model for dimensionality reduction
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)

# Reduce embeddings dimensions for visualization
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)

from hdbscan import HDBSCAN
# Initialize HDBSCAN model for clustering
hdbscan_model = HDBSCAN(metric='euclidean', cluster_selection_method='eom', prediction_data=True, min_cluster_size=10)

from sklearn.cluster import KMeans
# Initialize KMeans clustering model
cluster_model = KMeans(random_state=42, n_clusters=11)

from sklearn.feature_extraction.text import CountVectorizer
# Create a CountVectorizer with custom stop words
custom_exclude_words = ["world", "automotive", "post", 'first', 'new', 'car', 'cars', 'vehicle', 'vehicles', 'say', 'hello', 'welcome']
vectorizer_model = CountVectorizer(stop_words=custom_exclude_words, min_df=3)

from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, TextGeneration
# Initialize text generation model with Llama 2
llama2 = TextGeneration(generator, prompt)

# Dictionary of representation models
representation_model = {"Llama2": llama2}

from bertopic import BERTopic
# Initialize and train BERTopic model
topic_model = BERTopic(
    embedding_model=embedding_model,
    vectorizer_model=vectorizer_model,
    umap_model=umap_model,
    calculate_probabilities=True,
    hdbscan_model=hdbscan_model,
    representation_model=representation_model,
    top_n_words=10,
    verbose=True,
)

# Fit the topic model and transform the data
topics, probs = topic_model.fit_transform(df['PreprocessedText'].values, embeddings)

return topic_model, topics, probs
mahmawad commented 3 months ago

The Prompt :

example_prompt = """ I have a topic that contains the following documents:

The topic is described by the following keywords: 'meat, beef, eat, eating, emissions, steak, food, health, processed, chicken'.

Based on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more. Make sure not to mention any Companies or Cities names.

[/INST] Environmental impacts of eating meat """ main_prompt = """ [INST] I have a topic that contains the following documents: [DOCUMENTS]

The topic is described by the following keywords: '[KEYWORDS]'.

Based on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more. Make sure not to mention any Companies or Cities names. [/INST] """ prompt = example_prompt + main_prompt

mahmawad commented 3 months ago

it's a topic representation and here are some output examples : 'AI and Data Industry Trends' 'President's son trial in Manhattan'

as you see the first one is good since it's general topic/ label

but second one isn't represbatble , do you think the problem is with the prompt ?

MaartenGr commented 3 months ago

but second one isn't represbatble , do you think the problem is with the prompt ?

It might be but it depends on the LLM that you are using. It's not in the code specifically but it seems you are using Llama 2 (can't see which version). You could also use Llama 3 which is quite a bit better or other newer models like Mistral, Phi-3, Command R+, Qwen2, etc.

Note that you can also track the prompts with: topic_model.representation_model["llama2"].prompts_. You might find something of interest there.