Open dhammo2 opened 7 months ago
That's correct. The cTFIDF representations cannot be merged since often two models have entirely different vocabularies. As a solution, it might be worthwhile to check whether we can use the embeddings instead. I haven't tried it yet but you could try this: topic_model.c_tf_idf_ = topic_model.topic_embeddings_
. It's not the cleanest solution but it might work for now.
I am also getting the similar error while running merged_model.hierarchical_topics(docs)
where merged_model is merged models as above. Added the following suggestion, but now throws a different error.
merged_model.c_tf_idf_ = merged_model.topic_embeddings_
hierarchical_topics = merged_model.hierarchical_topics(docs)
Error:
ValueError Traceback (most recent call last)
Cell In[31], line 1
----> 1 hierarchical_topics = merged_model.hierarchical_topics(docs)
File /mnt/xarfuse/uid-564347/9d128abc-seed-nspid4026531836_cgpid20335105-ns-4026531841/bertopic/_bertopic.py:983, in BERTopic.hierarchical_topics(self, docs, linkage_function, distance_function)
980 Z = linkage_function(X)
982 # Calculate basic bag-of-words to be iteratively merged later
--> 983 documents = pd.DataFrame({"Document": docs,
984 "ID": range(len(docs)),
985 "Topic": self.topics_})
986 documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
987 documents_per_topic = documents_per_topic.loc[documents_per_topic.Topic != -1, :]
File /mnt/xarfuse/uid-564347/9d128abc-seed-nspid4026531836_cgpid20335105-ns-4026531841/pandas/core/frame.py:709, in DataFrame.__init__(self, data, index, columns, dtype, copy)
703 mgr = self._init_mgr(
704 data, axes={"index": index, "columns": columns}, dtype=dtype, copy=copy
705 )
707 elif isinstance(data, dict):
708 # GH#38939 de facto copy defaults to False only in non-dict cases
--> 709 mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
710 elif isinstance(data, ma.MaskedArray):
711 from numpy.ma import mrecords
File /mnt/xarfuse/uid-564347/9d128abc-seed-nspid4026531836_cgpid20335105-ns-4026531841/pandas/core/internals/construction.py:481, in dict_to_mgr(data, index, columns, dtype, typ, copy)
477 else:
478 # dtype check to exclude e.g. range objects, scalars
479 arrays = [x.copy() if hasattr(x, "dtype") else x for x in arrays]
--> 481 return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
File /mnt/xarfuse/uid-564347/9d128abc-seed-nspid4026531836_cgpid20335105-ns-4026531841/pandas/core/internals/construction.py:115, in arrays_to_mgr(arrays, columns, index, dtype, verify_integrity, typ, consolidate)
112 if verify_integrity:
113 # figure out the index, if necessary
114 if index is None:
--> 115 index = _extract_index(arrays)
116 else:
117 index = ensure_index(index)
File /mnt/xarfuse/uid-564347/9d128abc-seed-nspid4026531836_cgpid20335105-ns-4026531841/pandas/core/internals/construction.py:655, in _extract_index(data)
653 lengths = list(set(raw_lengths))
654 if len(lengths) > 1:
--> 655 raise ValueError("All arrays must be of the same length")
657 if have_dicts:
658 raise ValueError(
659 "Mixing dicts with non-Series may lead to ambiguous ordering."
660 )
ValueError: All arrays must be of the same length
@rashigupta8496 I believe you get that error for the same reason. Instead, you can use the use_ctfidf=True
parameter from a PR that was just pushed to the main branch. You could use it as follows:
hierarchical_topics = merged_model.hierarchical_topics(docs, use_ctfidf=False)
Also, just note that running the following is not advised since one is expected to be a sparse matrix and the other not which can result in a bunch of issues:
merged_model.c_tf_idf_ = merged_model.topic_embeddings_
Problem Overview When using the
merged_model = BERTopic.merge_models([topic_model_1, topic_model_2])
command the produced merged topic model cannot be visualised as a hierarchical topic model anymore, even if the constituent models can be.Error Code
Minimum Working Example