LLM labels not updating after merging topics

JessDataNLP commented 8 months ago

Good morning, I am facing a problem with the topic labels after merging.

to_merge = [[180,90],[71,10],[114,145],[110,139],[91,97],[104,162],[62,116],[89,161],[171,38],[124,65],[83,99],[160,102],[79,6],[26,56],[98,142]]
topic_model.merge_topics(docs, to_merge)

#update the topic description
topics_description = topic_model.get_topic_info()

Specifically, after having merged and updated the topics, the df that contains the list of topic does not update the labels created with the representational model (I use the chat gpt APIs), which means that they are mismatched vis a vis the updated topics (even if the number of labels is the same as the number of new topics after merging). Is there a way to update these labels?

MaartenGr commented 8 months ago

Could you share your full code and demonstrate which exact labels (i.e., column in topics_description) are not being updated? Also, which version of BERTopic are you using? It would help me get a better idea of what exactly is happening. Thanks!

JessDataNLP commented 8 months ago

Good morning Martin, thanks for your reply. The version I am using is 0.16.0. on VisualStudioCode via WSL. Here is the whole code

# import df
df = pd.read_csv('df_v1.csv', index_col=False)

# Access the column containing the documents from the DataFrame
docs = df["clean_text"].tolist()

#check GPU availability
if torch.cuda.is_available():
    device = "cuda:0"
else:
    device = "cpu"

device = torch.device(device)
#define model  embeddings
from sentence_transformers import SentenceTransformer
embedding_model_name = "paraphrase-multilingual-MiniLM-L12-v2"
embedding_model = SentenceTransformer(embedding_model_name, device="cuda") #

# Integrating BERT Topic with GPT3-5
import openai
from bertopic.representation import OpenAI 

k = "sk..."
client = openai.OpenAI(api_key=k)

prompt = """
I have a topic that contains the following documents: 
[DOCUMENTS]
The topic is described by the following keywords: [KEYWORDS]

Based on the information above, extract a short topic label in the following format:
topic: <topic label>"""

representation_model = {'keywords': '[KEYWORDS]', 'LLM_description': OpenAI(client, model="gpt-3.5-turbo", chat=True, prompt=prompt)}

from sentence_transformers import SentenceTransformer
from cuml.manifold import UMAP
from cuml.cluster import HDBSCAN
# Define sub-models
umap_model = UMAP(n_neighbors= 10, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
hdbscan_model = HDBSCAN(min_cluster_size=90, metric='euclidean, cluster_selection_method='eom', prediction_data=True)

#from bertopic import BERTopic

topic_model = BERTopic(

  # Sub-models
  embedding_model=embedding_model, 
  umap_model=umap_model,
  hdbscan_model=hdbscan_model,
  representation_model=representation_model,

  # Hyperparameters
  top_n_words=10,
  verbose=True
)

# Train model
topics, probs = topic_model.fit_transform(docs)

# Show topics in df
topic_description =topic_model.get_topic_info()

#topic tree
tree = topic_model.get_topic_tree(hierarchical_topics)
print(tree)

# I will merge some topics according to the results of the topic tree
to_merge = [[180,90],[71,10],[114,145],[110,139],[91,97],[104,162],[62,116],[89,161],[171,38],[124,65],[83,99],[160,102],[79,6],[26,56],[98,142]]
topic_model.merge_topics(docs, to_merge)

#update the topic description
topics_description_new = topic_model.get_topic_info()

Basically the problem is the new df containing the topics which is called topics_description_new. After I have merged the similar topics, the new df has the updated number of topic (170). The df has the following columns: Topic|Count|Name|Representation|LLM_description|Representative_Docs. The problem is that the column LLM_description| is not updated, therefor it contains the wrong label, as it is still the old labels of the first output of the model called topic_description. Is there a recommended way to update the LLM_description column for merged topics in BERTopic, or should this process be handled manually?

MaartenGr commented 8 months ago

I am not entirely sure but the following is not accepted by BERTopic:

representation_model = {'keywords': '[KEYWORDS]', 'LLM_description': OpenAI(client, model="gpt-3.5-turbo", chat=True, prompt=prompt)}

You should use the following instead:

representation_model = {'LLM_description': OpenAI(client, model="gpt-3.5-turbo", chat=True, prompt=prompt)}

JessDataNLP commented 8 months ago

It is working now, thanks! That was an easy fix.

MaartenGr / BERTopic

LLM labels not updating after merging topics #1826