MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.79k stars 721 forks source link

Merge Multiple Fitted Models - Error on Viewing Hierarchical Topics #1682

Open dhammo2 opened 7 months ago

dhammo2 commented 7 months ago

Problem Overview When using the merged_model = BERTopic.merge_models([topic_model_1, topic_model_2]) command the produced merged topic model cannot be visualised as a hierarchical topic model anymore, even if the constituent models can be.

Error Code

hierarchical_topics = merged_model.hierarchical_topics(docs, linkage_function = linkage_function)
Traceback (most recent call last):

  Cell In[14], line 1
    hierarchical_topics = merged_model.hierarchical_topics(docs, linkage_function = linkage_function)

  File ~/anaconda3/envs/tf/lib/python3.9/site-packages/bertopic/_bertopic.py:975 in hierarchical_topics
    embeddings = self.c_tf_idf_[self._outliers:]

TypeError: 'NoneType' object is not subscriptable

Minimum Working Example

from umap import UMAP
from bertopic import BERTopic
from datasets import load_dataset
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))["data"]

# Create topic models
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
topic_model_1 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(docs[0:1000])
topic_model_2 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(docs[1000:2000])

# Combine all models into one
merged_model = BERTopic.merge_models([topic_model_1, topic_model_2])

# #Visualise Hierarchical Topic Model
linkage_function = lambda x: sch.linkage(x, 'ward', optimal_ordering=True)

#Use fitted model to extract hierarchies
hierarchical_topics = merged_model.hierarchical_topics(docs, linkage_function = linkage_function)

#Visualise Hierarchies
fig = merged_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)
fig.write_html("merged_model.html")
MaartenGr commented 7 months ago

That's correct. The cTFIDF representations cannot be merged since often two models have entirely different vocabularies. As a solution, it might be worthwhile to check whether we can use the embeddings instead. I haven't tried it yet but you could try this: topic_model.c_tf_idf_ = topic_model.topic_embeddings_. It's not the cleanest solution but it might work for now.

rashigupta8496 commented 1 month ago

I am also getting the similar error while running merged_model.hierarchical_topics(docs) where merged_model is merged models as above. Added the following suggestion, but now throws a different error.

merged_model.c_tf_idf_ = merged_model.topic_embeddings_

hierarchical_topics = merged_model.hierarchical_topics(docs)

Error:

ValueError                                Traceback (most recent call last)
Cell In[31], line 1
----> 1 hierarchical_topics = merged_model.hierarchical_topics(docs)
File /mnt/xarfuse/uid-564347/9d128abc-seed-nspid4026531836_cgpid20335105-ns-4026531841/bertopic/_bertopic.py:983, in BERTopic.hierarchical_topics(self, docs, linkage_function, distance_function)
    980 Z = linkage_function(X)
    982 # Calculate basic bag-of-words to be iteratively merged later
--> 983 documents = pd.DataFrame({"Document": docs,
    984                           "ID": range(len(docs)),
    985                           "Topic": self.topics_})
    986 documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
    987 documents_per_topic = documents_per_topic.loc[documents_per_topic.Topic != -1, :]
File /mnt/xarfuse/uid-564347/9d128abc-seed-nspid4026531836_cgpid20335105-ns-4026531841/pandas/core/frame.py:709, in DataFrame.__init__(self, data, index, columns, dtype, copy)
    703     mgr = self._init_mgr(
    704         data, axes={"index": index, "columns": columns}, dtype=dtype, copy=copy
    705     )
    707 elif isinstance(data, dict):
    708     # GH#38939 de facto copy defaults to False only in non-dict cases
--> 709     mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
    710 elif isinstance(data, ma.MaskedArray):
    711     from numpy.ma import mrecords
File /mnt/xarfuse/uid-564347/9d128abc-seed-nspid4026531836_cgpid20335105-ns-4026531841/pandas/core/internals/construction.py:481, in dict_to_mgr(data, index, columns, dtype, typ, copy)
    477     else:
    478         # dtype check to exclude e.g. range objects, scalars
    479         arrays = [x.copy() if hasattr(x, "dtype") else x for x in arrays]
--> 481 return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
File /mnt/xarfuse/uid-564347/9d128abc-seed-nspid4026531836_cgpid20335105-ns-4026531841/pandas/core/internals/construction.py:115, in arrays_to_mgr(arrays, columns, index, dtype, verify_integrity, typ, consolidate)
    112 if verify_integrity:
    113     # figure out the index, if necessary
    114     if index is None:
--> 115         index = _extract_index(arrays)
    116     else:
    117         index = ensure_index(index)
File /mnt/xarfuse/uid-564347/9d128abc-seed-nspid4026531836_cgpid20335105-ns-4026531841/pandas/core/internals/construction.py:655, in _extract_index(data)
    653 lengths = list(set(raw_lengths))
    654 if len(lengths) > 1:
--> 655     raise ValueError("All arrays must be of the same length")
    657 if have_dicts:
    658     raise ValueError(
    659         "Mixing dicts with non-Series may lead to ambiguous ordering."
    660     )
ValueError: All arrays must be of the same length
MaartenGr commented 1 month ago

@rashigupta8496 I believe you get that error for the same reason. Instead, you can use the use_ctfidf=True parameter from a PR that was just pushed to the main branch. You could use it as follows:

hierarchical_topics = merged_model.hierarchical_topics(docs, use_ctfidf=False)

Also, just note that running the following is not advised since one is expected to be a sparse matrix and the other not which can result in a bunch of issues:

merged_model.c_tf_idf_ = merged_model.topic_embeddings_