Closed JessDataNLP closed 8 months ago
Could you share your full code and demonstrate which exact labels (i.e., column in topics_description
) are not being updated? Also, which version of BERTopic are you using? It would help me get a better idea of what exactly is happening. Thanks!
Good morning Martin, thanks for your reply. The version I am using is 0.16.0. on VisualStudioCode via WSL. Here is the whole code
# import df
df = pd.read_csv('df_v1.csv', index_col=False)
# Access the column containing the documents from the DataFrame
docs = df["clean_text"].tolist()
#check GPU availability
if torch.cuda.is_available():
device = "cuda:0"
else:
device = "cpu"
device = torch.device(device)
#define model embeddings
from sentence_transformers import SentenceTransformer
embedding_model_name = "paraphrase-multilingual-MiniLM-L12-v2"
embedding_model = SentenceTransformer(embedding_model_name, device="cuda") #
# Integrating BERT Topic with GPT3-5
import openai
from bertopic.representation import OpenAI
k = "sk..."
client = openai.OpenAI(api_key=k)
prompt = """
I have a topic that contains the following documents:
[DOCUMENTS]
The topic is described by the following keywords: [KEYWORDS]
Based on the information above, extract a short topic label in the following format:
topic: <topic label>"""
representation_model = {'keywords': '[KEYWORDS]', 'LLM_description': OpenAI(client, model="gpt-3.5-turbo", chat=True, prompt=prompt)}
from sentence_transformers import SentenceTransformer
from cuml.manifold import UMAP
from cuml.cluster import HDBSCAN
# Define sub-models
umap_model = UMAP(n_neighbors= 10, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
hdbscan_model = HDBSCAN(min_cluster_size=90, metric='euclidean, cluster_selection_method='eom', prediction_data=True)
#from bertopic import BERTopic
topic_model = BERTopic(
# Sub-models
embedding_model=embedding_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
representation_model=representation_model,
# Hyperparameters
top_n_words=10,
verbose=True
)
# Train model
topics, probs = topic_model.fit_transform(docs)
# Show topics in df
topic_description =topic_model.get_topic_info()
#topic tree
tree = topic_model.get_topic_tree(hierarchical_topics)
print(tree)
# I will merge some topics according to the results of the topic tree
to_merge = [[180,90],[71,10],[114,145],[110,139],[91,97],[104,162],[62,116],[89,161],[171,38],[124,65],[83,99],[160,102],[79,6],[26,56],[98,142]]
topic_model.merge_topics(docs, to_merge)
#update the topic description
topics_description_new = topic_model.get_topic_info()
Basically the problem is the new df containing the topics which is called topics_description_new. After I have merged the similar topics, the new df has the updated number of topic (170). The df has the following columns: Topic|Count|Name|Representation|LLM_description|Representative_Docs. The problem is that the column LLM_description| is not updated, therefor it contains the wrong label, as it is still the old labels of the first output of the model called topic_description. Is there a recommended way to update the LLM_description column for merged topics in BERTopic, or should this process be handled manually?
I am not entirely sure but the following is not accepted by BERTopic:
representation_model = {'keywords': '[KEYWORDS]', 'LLM_description': OpenAI(client, model="gpt-3.5-turbo", chat=True, prompt=prompt)}
You should use the following instead:
representation_model = {'LLM_description': OpenAI(client, model="gpt-3.5-turbo", chat=True, prompt=prompt)}
It is working now, thanks! That was an easy fix.
Good morning, I am facing a problem with the topic labels after merging.
Specifically, after having merged and updated the topics, the df that contains the list of topic does not update the labels created with the representational model (I use the chat gpt APIs), which means that they are mismatched vis a vis the updated topics (even if the number of labels is the same as the number of new topics after merging). Is there a way to update these labels?