MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.76k stars 716 forks source link

Issues with visualizations on loaded models. #2032

Open andreia-sandata opened 4 weeks ago

andreia-sandata commented 4 weeks ago

I am storing my model in this manner: embedding_model = "sentence-transformers/all-MiniLM-L6-v2" topic_model.save("data/topic_model", serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model)

Then, I load it like: topic_model = BERTopic.load("/tmp/topic_model")

Then I want to use the original set of documents that the model was fitted on to visualize these topics over time.

topics_over_time = main_model.topics_over_time( ticket_training_set.description, ticket_training_set.created_at, nr_bins=30 ) main_model.visualize_topics_over_time(topics_over_time)

I get the following error:

ValueError                                Traceback (most recent call last)
Cell In[35], line 1
----> 1 topics_over_time = main_model.topics_over_time(
      2     ticket_training_set.description,
      3     ticket_training_set.created_at,
      4     nr_bins=30
      5 )
      6 main_model.visualize_topics_over_time(topics_over_time)

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/bertopic/_bertopic.py:820, in BERTopic.topics_over_time(self, docs, timestamps, topics, nr_bins, datetime_format, evolution_tuning, global_tuning)
    818 if global_tuning:
    819     selected_topics = [all_topics_indices[topic] for topic in documents_per_topic.Topic.values]
--> 820     c_tf_idf = (global_c_tf_idf[selected_topics] + c_tf_idf) / 2.0
    822 # Extract the words per topic
    823 words_per_topic = self._extract_words_per_topic(words, selection, c_tf_idf, calculate_aspects=False)

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/scipy/sparse/_index.py:77, in IndexMixin.__getitem__(self, key)
     75         return self._get_arrayXint(row, col)
     76     elif isinstance(col, slice):
---> 77         return self._get_arrayXslice(row, col)
     78 else:  # row.ndim == 2
     79     if isinstance(col, INT_TYPES):

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/scipy/sparse/_csr.py:216, in _csr_base._get_arrayXslice(self, row, col)
    214     col = np.arange(*col.indices(self.shape[1]))
    215     return self._get_arrayXarray(row, col)
--> 216 return self._major_index_fancy(row)._get_submatrix(minor=col)

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/scipy/sparse/_compressed.py:711, in _cs_matrix._major_index_fancy(self, idx)
    708 np.cumsum(row_nnz, out=res_indptr[1:])
    710 nnz = res_indptr[-1]
--> 711 res_indices = np.empty(nnz, dtype=idx_dtype)
    712 res_data = np.empty(nnz, dtype=self.dtype)
    713 csr_row_index(M, indices, self.indptr, self.indices, self.data,
    714               res_indices, res_data)

ValueError: negative dimensions are not allowed

The same error happens on other visualizations, such as topic_model.visualize_hierarchy().

What am I missing here? Thank you.

MaartenGr commented 3 weeks ago

Which version of BERTopic are you using? Also, could you also share your training code?