MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.02k stars 755 forks source link

Zero-Shot Topic Modelling and Topics Over Time #1999

Open LopezBanos opened 4 months ago

LopezBanos commented 4 months ago

I created a Zero-Shot Model with certain topics specified and some that Zero Topics found.

# BERTopic Model

topic_model = BERTopic(
    embedding_model="thenlper/gte-small", # https://huggingface.co/thenlper/gte-large
    min_topic_size=15,
    zeroshot_topic_list=zeroshot_topic_list,
    zeroshot_min_similarity=.80,
    representation_model=KeyBERTInspired()
)
# Results
topics, probs = topic_model.fit_transform(docs)

If I want to plot the Topics Over Time I got an error:

# Topics Over Time (docs was a pd.Series and now I convert it to a list, both docs.to_list() and timestamps have 161 lenght)
topics_over_time = topic_model.topics_over_time(docs.to_list(), timestamps) # Error Happens in this line
model.visualize_topics_over_time(topics_over_time, topics=[0,1,2,3,4,5,6,7,8,9])

The error I get is:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[14], line 2
      1 # Topics Over Time
----> 2 topics_over_time = topic_model.topics_over_time(docs.to_list(), timestamps)
      3 model.visualize_topics_over_time(topics_over_time, topics=[0,1,2,3,4,5,6,7,8,9])

File ~/.conda/envs/BerTopicOctis/lib/python3.10/site-packages/bertopic/_bertopic.py:768, in BERTopic.topics_over_time(self, docs, timestamps, topics, nr_bins, datetime_format, evolution_tuning, global_tuning)
    766 selected_topics = topics if topics else self.topics_
    767 documents = pd.DataFrame({"Document": docs, "Topic": selected_topics, "Timestamps": timestamps})
--> 768 global_c_tf_idf = normalize(self.c_tf_idf_, axis=1, norm='l1', copy=False)
    770 all_topics = sorted(list(documents.Topic.unique()))
    771 all_topics_indices = {topic: index for index, topic in enumerate(all_topics)}

File ~/.conda/envs/BerTopicOctis/lib/python3.10/site-packages/sklearn/preprocessing/_data.py:1786, in normalize(X, norm, axis, copy, return_norm)
   1783 else:
   1784     raise ValueError("'%d' is not a supported axis" % axis)
-> 1786 X = check_array(
   1787     X,
   1788     accept_sparse=sparse_format,
   1789     copy=copy,
   1790     estimator="the normalize function",
   1791     dtype=FLOAT_DTYPES,
   1792 )
   1793 if axis == 0:
   1794     X = X.T

File ~/.conda/envs/BerTopicOctis/lib/python3.10/site-packages/sklearn/utils/validation.py:867, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
    864 if ensure_2d:
    865     # If input is scalar raise error
    866     if array.ndim == 0:
--> 867         raise ValueError(
    868             "Expected 2D array, got scalar array instead:\narray={}.\n"
    869             "Reshape your data either using array.reshape(-1, 1) if "
    870             "your data has a single feature or array.reshape(1, -1) "
    871             "if it contains a single sample.".format(array)
    872         )
    873     # If input is 1D raise error
    874     if array.ndim == 1:

ValueError: Expected 2D array, got scalar array instead:
array=nan.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
MaartenGr commented 4 months ago

Sorry for this! Zero-shot topoic modeling is not available at the moment together with topics over time because of the missing c-TF-IDF matrix. Instead, you can use .update_topics so that the underlying c-TF-IDF matrices are created. After that, it should work.