MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.76k stars 716 forks source link

TypeError:'NoneType' object is not subscriptable while calling topic_model.hierarchical_topics #2028

Open rashigupta8496 opened 4 weeks ago

rashigupta8496 commented 4 weeks ago

Hi, I am trying zero-shot topic modelling with BERTopic. The following fit_transform ran successfully:

    topic_model = BERTopic(
        embedding_model="thenlper/gte-small", 
        min_topic_size=50,
        zeroshot_topic_list=zeroshot_topic_list,
        zeroshot_min_similarity=.85,
        representation_model=KeyBERTInspired()
    )
    topics, _ = topic_model.fit_transform(docs)

topic_model.get_topic_info()

While running hierarchical_topics = topic_model.hierarchical_topics(docs), getting the following error:

TypeError:'NoneType' object is not subscriptable
Open Traceback
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[87], line 1
----> 1 hierarchical_topics = topic_model.hierarchical_topics(docs)
File /mnt/xarfuse/uid-564347/e4f2f620-seed-nspid4026531836_cgpid15010019-ns-4026531841/bertopic/_bertopic.py:975, in BERTopic.hierarchical_topics(self, docs, linkage_function, distance_function)
    972     linkage_function = lambda x: sch.linkage(x, 'ward', optimal_ordering=True)
    974 # Calculate distance
--> 975 embeddings = self.c_tf_idf_[self._outliers:]
    976 X = distance_function(embeddings)
    977 X = validate_distance_matrix(X, embeddings.shape[0])
TypeError: 'NoneType' object is not subscriptable

zeroshot_topic_list contains 251 topics, docs are 15k. Similar code without zeroshot works fine. Please let me know if you have any insights, thanks!

MaartenGr commented 4 weeks ago

At the moment it is not possible to use hierarchical topic modeling with zero-shot topic modeling. Instead, you will have to run .update_topics in order to get the c-TF-IDF representations that are needed for hierarchical topic modeling.

There is currently some work being done on this but I can't be sure when it releases.

rashigupta8496 commented 3 weeks ago

Thanks a lot for your response! I tried running the following update but it failed, "topics" are output of topic_model.fit_transform(docs):

topic_model.update_topics(docs, topics=topics)

Error:
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[98], line 1
----> 1 topic_model.update_topics(docs, topics=topics)
File /mnt/xarfuse/uid-564347/e4f2f620-seed-nspid4026531836_cgpid15010019-ns-4026531841/bertopic/_bertopic.py:1429, in BERTopic.update_topics(self, docs, images, topics, top_n_words, n_gram_range, vectorizer_model, ctfidf_model, representation_model)
   1427 self.c_tf_idf_, words = self._c_tf_idf(documents_per_topic)
   1428 self.topic_representations_ = self._extract_words_per_topic(words, documents)
-> 1429 if set(topics) != self.topics_:
   1430     self._create_topic_vectors()
   1431 self.topic_labels_ = {key: f"{key}_" + "_".join([word[0] for word in values[:4]])
   1432                       for key, values in
   1433                       self.topic_representations_.items()}
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Any insights how to run update for this case will be helpful, thanks.

MaartenGr commented 3 weeks ago

Thanks a lot for your response! I tried running the following update but it failed, "topics" are output of

Could you share the full code? It is not entirely clear to me which variables/attributes you used.

rashigupta8496 commented 3 weeks ago

Thanks for your response. I am running the following basic version for now:

from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
from data.docs import FetchDataToCluster
from data.seed import FetchSeedForCluster

docs = FetchDataToCluster()
topic_model = BERTopic(
    embedding_model="thenlper/gte-small", 
    min_topic_size=10,
    zeroshot_topic_list=FetchSeedForCluster(),
    zeroshot_min_similarity=.85,
    representation_model=KeyBERTInspired()
)
topics, _ = topic_model.fit_transform(docs)

topic_model.update_topics(docs, topics=topics)

hierarchical_topics = topic_model.hierarchical_topics(docs)

topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)

And getting the following error:

ValueError                                Traceback (most recent call last)
Cell In[10], line 1
----> 1 topic_model.update_topics(docs, topics=topics)
File /mnt/xarfuse/uid-564347/6641696b-seed-nspid4026531836_cgpid20335105-ns-4026531841/bertopic/_bertopic.py:1429, in BERTopic.update_topics(self, docs, images, topics, top_n_words, n_gram_range, vectorizer_model, ctfidf_model, representation_model)
   1427 self.c_tf_idf_, words = self._c_tf_idf(documents_per_topic)
   1428 self.topic_representations_ = self._extract_words_per_topic(words, documents)
-> 1429 if set(topics) != self.topics_:
   1430     self._create_topic_vectors()
   1431 self.topic_labels_ = {key: f"{key}_" + "_".join([word[0] for word in values[:4]])
   1432                       for key, values in
   1433                       self.topic_representations_.items()}
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
MaartenGr commented 3 weeks ago

Aside from the potential issue that you are facing, using the PR that was just pushed to the main branch you can now remove the .update_topics line and instead do it as follows:

from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
from data.docs import FetchDataToCluster
from data.seed import FetchSeedForCluster

docs = FetchDataToCluster()
topic_model = BERTopic(
    embedding_model="thenlper/gte-small", 
    min_topic_size=10,
    zeroshot_topic_list=FetchSeedForCluster(),
    zeroshot_min_similarity=.85,
    representation_model=KeyBERTInspired()
)
topics, _ = topic_model.fit_transform(docs)

hierarchical_topics = topic_model.hierarchical_topics(docs, use_ctfidf=False)
topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics, use_ctfidf=False)

Having said that, it is strange that you get this issue since I cannot seem to reproduce it with this:

from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired

dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]

# Extract abstracts to train on and corresponding titles
abstracts = dataset["abstract"][:10_000]
titles = dataset["title"][:10_000]

# Pre-calculate embeddings
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(abstracts, show_progress_bar=True)

# Train model
zeroshot_topic_list = ["Clustering", "Topic Modeling", "Large Language Models"]
topic_model = BERTopic(
    embedding_model=embedding_model,
    min_topic_size=10,
    zeroshot_topic_list=zeroshot_topic_list,
    zeroshot_min_similarity=.85,
    representation_model=KeyBERTInspired()
)
topics, _ = topic_model.fit_transform(abstracts, embeddings)

# Update topics
topic_model.update_topics(abstracts, topics=topics)

Note that I'm using BERTopic v0.16.2 for this piece of code.