Open mahmawad opened 5 months ago
Thanks for sharing but I am not familiar with your .py script. I will need a bit more information to understand what is happening here. Could you share your full code along with the version of BERTopic you are using?
thank you for replying
a normal importing for llama 2 and then I save visualization using write_html function from sentence_transformers import SentenceTransformer
# Pre-calculate embeddings
embedding_model = SentenceTransformer("all-mpnet-base-v2")
embeddings = embedding_model.encode(df_articles['PreprocessedText'].tolist(), show_progress_bar=True)
# ft = api.load('fasttext-wiki-news-subwords-300')
#
# In[18]:
from umap import UMAP
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
# In[19]:
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
# In[20]:
from hdbscan import HDBSCAN
hdbscan_model = HDBSCAN(metric='euclidean', cluster_selection_method='eom', prediction_data=True,min_cluster_size=15)
# In[20]:
from sklearn.cluster import KMeans
#cluster_model = KMeans(n_clusters=6, random_state=42)
cluster_model = KMeans(random_state=42,n_clusters=11)
# In[21]:
from sklearn.feature_extraction.text import CountVectorizer
# Custom list of words to exclude
custom_exclude_words = ["world", "automotive", "post",'first','new','car','cars','vehicle','vehicles','say']
# Merge the custom words with the standard stop words
vectorizer_model = CountVectorizer(stop_words=custom_exclude_words, min_df=3)
# In[22]:
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, TextGeneration
# KeyBERT
keybert = KeyBERTInspired()
# MMR
mmr = MaximalMarginalRelevance(diversity=0.3)
# Text generation with Llama 2
llama2 = TextGeneration(generator, prompt=prompt)
# All representation models
representation_model = {
"KeyBERT": keybert,
"Llama2": llama2,
"MMR": mmr,
}
# In[2]:
"""
import torch
print(torch.cuda.memory_summary(device=None, abbreviated=False))
torch.cuda.empty_cache()
"""
# In[23]:
topics_inp=df_articles['PreprocessedText'].tolist()
# In[24]:
from bertopic import BERTopic
topic_model = BERTopic(
# Pipeline models
embedding_model=embedding_model,
vectorizer_model=vectorizer_model,
umap_model=umap_model,
hdbscan_model=cluster_model,
representation_model=representation_model,
#ctfidf_model=ctfidf_model,
# Hyperparameters
top_n_words=10,
verbose=True,
)
topics, probs = topic_model.fit_transform(topics_inp,embeddings)
# In[27]:
#topic_model.merge_topics(df_articles['PreprocessedText'].tolist(),[5,2])
# In[ ]:
# use one of the other topic representations, like KeyBERTInspired
#keybert_topic_labels = {topic: " | ".join(list(zip(*values))[0][:4]) for topic, values in topic_model.topic_aspects_["Llama2"].items()}
#topic_model.set_topic_labels(keybert_topic_labels)
# In[28]:
llama2_labels = [label[0][0].split("\n")[0] for label in topic_model.get_topics(full=True)["Llama2"].values()]
topic_model.set_topic_labels(llama2_labels)
# In[29]:
topic_model.get_topic_info()
# In[30]:
# Visualize the documents in 2-dimensional space and show the titles on hover instead of the abstracts
# NOTE: You can hide the hover with `hide_document_hover=True` which is especially helpful if you have a large dataset
viss=topic_model.visualize_documents(topics_inp, custom_labels=True,hide_annotations=False,hide_document_hover=False)
path_file = r"/home/amahmoud/workspace/vis_two_week_visul.html"
viss.write_html(path_file)
Could you check what labels you set in llama2_labels
? There might be something going on there that Llama 2 might not have created all labels.
i checked them but i think the problem is when I run it in a py script. it works well when i run it in Jupyter Notebook but I need it in py file so it could be automated
That's strange as the output is actually HTML I believe and should not render differently in a Jupyter Notebook compared to using .py
when I run topicmodeling in .py script I got this issue