Open rashigupta8496 opened 4 weeks ago
At the moment it is not possible to use hierarchical topic modeling with zero-shot topic modeling. Instead, you will have to run .update_topics
in order to get the c-TF-IDF representations that are needed for hierarchical topic modeling.
There is currently some work being done on this but I can't be sure when it releases.
Thanks a lot for your response! I tried running the following update but it failed, "topics" are output of topic_model.fit_transform(docs)
:
topic_model.update_topics(docs, topics=topics)
Error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[98], line 1
----> 1 topic_model.update_topics(docs, topics=topics)
File /mnt/xarfuse/uid-564347/e4f2f620-seed-nspid4026531836_cgpid15010019-ns-4026531841/bertopic/_bertopic.py:1429, in BERTopic.update_topics(self, docs, images, topics, top_n_words, n_gram_range, vectorizer_model, ctfidf_model, representation_model)
1427 self.c_tf_idf_, words = self._c_tf_idf(documents_per_topic)
1428 self.topic_representations_ = self._extract_words_per_topic(words, documents)
-> 1429 if set(topics) != self.topics_:
1430 self._create_topic_vectors()
1431 self.topic_labels_ = {key: f"{key}_" + "_".join([word[0] for word in values[:4]])
1432 for key, values in
1433 self.topic_representations_.items()}
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Any insights how to run update for this case will be helpful, thanks.
Thanks a lot for your response! I tried running the following update but it failed, "topics" are output of
Could you share the full code? It is not entirely clear to me which variables/attributes you used.
Thanks for your response. I am running the following basic version for now:
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
from data.docs import FetchDataToCluster
from data.seed import FetchSeedForCluster
docs = FetchDataToCluster()
topic_model = BERTopic(
embedding_model="thenlper/gte-small",
min_topic_size=10,
zeroshot_topic_list=FetchSeedForCluster(),
zeroshot_min_similarity=.85,
representation_model=KeyBERTInspired()
)
topics, _ = topic_model.fit_transform(docs)
topic_model.update_topics(docs, topics=topics)
hierarchical_topics = topic_model.hierarchical_topics(docs)
topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)
And getting the following error:
ValueError Traceback (most recent call last)
Cell In[10], line 1
----> 1 topic_model.update_topics(docs, topics=topics)
File /mnt/xarfuse/uid-564347/6641696b-seed-nspid4026531836_cgpid20335105-ns-4026531841/bertopic/_bertopic.py:1429, in BERTopic.update_topics(self, docs, images, topics, top_n_words, n_gram_range, vectorizer_model, ctfidf_model, representation_model)
1427 self.c_tf_idf_, words = self._c_tf_idf(documents_per_topic)
1428 self.topic_representations_ = self._extract_words_per_topic(words, documents)
-> 1429 if set(topics) != self.topics_:
1430 self._create_topic_vectors()
1431 self.topic_labels_ = {key: f"{key}_" + "_".join([word[0] for word in values[:4]])
1432 for key, values in
1433 self.topic_representations_.items()}
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Aside from the potential issue that you are facing, using the PR that was just pushed to the main branch you can now remove the .update_topics
line and instead do it as follows:
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
from data.docs import FetchDataToCluster
from data.seed import FetchSeedForCluster
docs = FetchDataToCluster()
topic_model = BERTopic(
embedding_model="thenlper/gte-small",
min_topic_size=10,
zeroshot_topic_list=FetchSeedForCluster(),
zeroshot_min_similarity=.85,
representation_model=KeyBERTInspired()
)
topics, _ = topic_model.fit_transform(docs)
hierarchical_topics = topic_model.hierarchical_topics(docs, use_ctfidf=False)
topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics, use_ctfidf=False)
Having said that, it is strange that you get this issue since I cannot seem to reproduce it with this:
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]
# Extract abstracts to train on and corresponding titles
abstracts = dataset["abstract"][:10_000]
titles = dataset["title"][:10_000]
# Pre-calculate embeddings
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(abstracts, show_progress_bar=True)
# Train model
zeroshot_topic_list = ["Clustering", "Topic Modeling", "Large Language Models"]
topic_model = BERTopic(
embedding_model=embedding_model,
min_topic_size=10,
zeroshot_topic_list=zeroshot_topic_list,
zeroshot_min_similarity=.85,
representation_model=KeyBERTInspired()
)
topics, _ = topic_model.fit_transform(abstracts, embeddings)
# Update topics
topic_model.update_topics(abstracts, topics=topics)
Note that I'm using BERTopic v0.16.2 for this piece of code.
Hi, I am trying zero-shot topic modelling with BERTopic. The following fit_transform ran successfully:
While running hierarchical_topics = topic_model.hierarchical_topics(docs), getting the following error:
zeroshot_topic_list contains 251 topics, docs are 15k. Similar code without zeroshot works fine. Please let me know if you have any insights, thanks!