Open mahmawad opened 5 months ago
First I would like to thank you for your great tool.
Thank you for the kind words!
is there a way to make the representation more general ?
It's difficult to say without seeing the full code, versions, output of .topic_info
, etc. For instance, it's not clear to me which topic representation model that you use. Could you provide a bit more information? I need your full training code, version of BERTopic, and the output when running .get_topic_info
.
This is one of the topic Representation in my Documents : "President's son trial in Manhattan"
Is this a representative document or the topic representation?
This is the full code :
def get_topic_modeling(df, prompt, model, tokenizer): """ Generates a topic model for a given DataFrame using various NLP and clustering techniques.
Args:
df (pd.DataFrame): DataFrame containing the preprocessed text data.
prompt (str): The prompt for the text generation model.
model (str): The name or path of the model to be used for text generation.
tokenizer (str): The tokenizer to be used with the model.
Returns:
topic_model: The trained BERTopic model.
topics: The topics identified by the model.
probs: The probabilities of the topics.
"""
from sentence_transformers import SentenceTransformer
from torch import bfloat16
import transformers
from torch import cuda
import pandas as pd
# Initialize text generation pipeline
generator = transformers.pipeline(
model=model,
tokenizer=tokenizer,
task='text-generation',
temperature=0.1,
max_new_tokens=20,
repetition_penalty=1.1
)
# Pre-calculate embeddings using SentenceTransformer
embedding_model = SentenceTransformer("all-mpnet-base-v2")
embeddings = embedding_model.encode(df['PreprocessedText'].tolist(), show_progress_bar=True)
from umap import UMAP
# Initialize UMAP model for dimensionality reduction
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
# Reduce embeddings dimensions for visualization
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
from hdbscan import HDBSCAN
# Initialize HDBSCAN model for clustering
hdbscan_model = HDBSCAN(metric='euclidean', cluster_selection_method='eom', prediction_data=True, min_cluster_size=10)
from sklearn.cluster import KMeans
# Initialize KMeans clustering model
cluster_model = KMeans(random_state=42, n_clusters=11)
from sklearn.feature_extraction.text import CountVectorizer
# Create a CountVectorizer with custom stop words
custom_exclude_words = ["world", "automotive", "post", 'first', 'new', 'car', 'cars', 'vehicle', 'vehicles', 'say', 'hello', 'welcome']
vectorizer_model = CountVectorizer(stop_words=custom_exclude_words, min_df=3)
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, TextGeneration
# Initialize text generation model with Llama 2
llama2 = TextGeneration(generator, prompt)
# Dictionary of representation models
representation_model = {"Llama2": llama2}
from bertopic import BERTopic
# Initialize and train BERTopic model
topic_model = BERTopic(
embedding_model=embedding_model,
vectorizer_model=vectorizer_model,
umap_model=umap_model,
calculate_probabilities=True,
hdbscan_model=hdbscan_model,
representation_model=representation_model,
top_n_words=10,
verbose=True,
)
# Fit the topic model and transform the data
topics, probs = topic_model.fit_transform(df['PreprocessedText'].values, embeddings)
return topic_model, topics, probs
The Prompt :
example_prompt = """ I have a topic that contains the following documents:
The topic is described by the following keywords: 'meat, beef, eat, eating, emissions, steak, food, health, processed, chicken'.
Based on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more. Make sure not to mention any Companies or Cities names.
[/INST] Environmental impacts of eating meat """ main_prompt = """ [INST] I have a topic that contains the following documents: [DOCUMENTS]
The topic is described by the following keywords: '[KEYWORDS]'.
Based on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more. Make sure not to mention any Companies or Cities names. [/INST] """ prompt = example_prompt + main_prompt
it's a topic representation and here are some output examples : 'AI and Data Industry Trends' 'President's son trial in Manhattan'
as you see the first one is good since it's general topic/ label
but second one isn't represbatble , do you think the problem is with the prompt ?
but second one isn't represbatble , do you think the problem is with the prompt ?
It might be but it depends on the LLM that you are using. It's not in the code specifically but it seems you are using Llama 2 (can't see which version). You could also use Llama 3 which is quite a bit better or other newer models like Mistral, Phi-3, Command R+, Qwen2, etc.
Note that you can also track the prompts with: topic_model.representation_model["llama2"].prompts_
. You might find something of interest there.
First I would like to thank you for your great tool.
I have a question, This is one of the topic Representation in my Documents :
"President's son trial in Manhattan"
However most of the documents under this topic aren't related to Hunter Biden but yes mostly talk about politics,
is there a way to make the representation more general ?