MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.12k stars 763 forks source link

IndexError on `visualize_heatmap()` when all outliers are removed by `reduce_outliers()` and `update_topics()` #1455

Closed Aratako closed 1 year ago

Aratako commented 1 year ago

I encountered an IndexError with the visualization of heatmap. Here's the sequence of steps I took and the resulting error. I instantiated a BERTopic model and ran the fit_transform() method as follows:

topic_model = BERTopic(
    embedding_model=embedding_model,
    vectorizer_model=vectorizer,
    calculate_probabilities=True,
    verbose=True,
    nr_topics=20,
)
topics, probs = topic_model.fit_transform(text_data, embeddings=embeddings)

Subsequently, I executed the following code:

new_topics = topic_model.reduce_outliers(text_data, topics, strategy="embeddings", embeddings=embeddings)
topic_model.update_topics(text_data, topics=new_topics, top_n_words=10)

After that, when I used topic_model.visualize_heatmap(), I encountered this IndexError:

File "<my directory>\bertopic\plotting\_heatmap.py", line 94, in visualize_heatmap
    embeddings = embeddings[indices]
IndexError: index 18 is out of bounds for axis 0 with size 18

I suspect the problem originates from topic_model._outliers. Before reducing outliers, the index was off by one compared to the topic number, as shown below (where the first column represents the index and the second column represents the topic number):

0   -1
1    0
2    1
3    2
... ...
19   18

After outlier reduction, it appears that all outliers were removed, resulting in the same index and topic number like below:

0   0
1   1
2   2
3   3
... ...
18   18

However, topic_model._outliers returned 1 for both patterns. In the visualize_heatmap() method of _heatmap.py, the embeddings array is sliced using topic_model._outliers like embeddings = np.array(topic_model.topic_embeddings_)[topic_model._outliers:]

Both patterns produce an indices array like [ 0 1 2 3 ... 17 18]. The shape of embeddings varies depending on the slice. With outliers, embeddings.shape is (19, 768), but without outliers, it's (18, 768) because of slicing. This discrepancy causes the IndexError in embeddings = embeddings[indices].

I believe topic_model._outliers should be updated to 0 when all outliers are removed. A potential fix could involve adding the following line to the update_topics() method in _bertopic.py:

self._outliers = 1 if -1 in set(topics) else 0

I've tested this fix for my specific error and it seemed to resolve the issue. However, I haven't tested other parts of the library, so I'm unsure this is right solution. If it's appropriate, I'm willing to submit a PR.

MaartenGr commented 1 year ago

Although I think the solution you propose makes sense, I am not entirely sure how that would affect other parts of the topic model. There are quite a few places where self._outliers is being used and we would have to be very confident that no mention of outliers whatsoever is found anywhere after .update_topics.

I think this needs to be extensively tested across a couple of use cases where self._outliers is being used in order to make sure that this update does not negatively influence other functions. What do you think?

Aratako commented 1 year ago

Yes, I agree with you. However, self._outliers is originally set to 0 when there are no outliers (for instance, when using k-means for clustering), so I believe the possibility of errors arising from this code change is relatively low. Of course, testing is still necessary. In line 3234 of _bertopic.py, in the _cluster_embeddings() method, the same code is implemented with a comment: https://github.com/MaartenGr/BERTopic/blob/37064e231defceb998ef2a778cef46fc769dbf00/bertopic/_bertopic.py#L3231-L3234

MaartenGr commented 1 year ago

Yes, I agree with you. However, self._outliers is originally set to 0 when there are no outliers (for instance, when using k-means for clustering), so I believe the possibility of errors arising from this code change is relatively low. Of course, testing is still necessary.

I am not entirely sure about this. It is not so much about whether self._outliers is 0 but whether there are other places outliers are found. For example, self.topics_ might contain outliers or self.topic_embeddings_ might also include the outliers topic. The same applies with self.ctfidf_ and a bunch or other attributes.

In line 3234 of _bertopic.py, in the _cluster_embeddings() method, the same code is implemented with a comment:

Here, it is safe to do this because this is the very first place where you will actually found outliers to be generated. After that, attributes are updated based on the existence of outliers or not. These updated attributes should be checked if we are going to set self._outliers = 0 manually afterwards. That is why it is so important to check all attributes and see whether they all contain some mention of an outlier topic after having fully reduced outliers.

Aratako commented 1 year ago

I understand your concerns. Indeed, it seems necessary to carefully check the impact that the modification would have. I have performed unit testing after making this update, and no errors were found. However, I am also unaware of the internal effects that this fix is having. Unfortunately, since I am not familiar with this library, further checks seem difficult.

By the way, is it unusual behavior for all outliers to be eliminated by reduce_outliers()? I have reduced outliers for two datasets using the strategy="embeddings", and in both cases, all the outliers were gone. If this is intended behavior and occurs frequently, addressing this issue could be crucial.

MaartenGr commented 1 year ago

I have performed unit testing after making this update, and no errors were found.

Unfortunately, I do not think that the unit tests cover that specific case, so it can easily be missed. Feel free to make the PR request and then if I can find sometime in the upcoming weeks, I can manually check what needs to be checked.

By the way, is it unusual behavior for all outliers to be eliminated by reduce_outliers()? I have reduced outliers for two datasets using the strategy="embeddings", and in both cases, all the outliers were gone. If this is intended behavior and occurs frequently, addressing this issue could be crucial.

Kinda a poor interface on my part. I figured that users would manually tweak the threshold parameter to reduce some outliers but not all. The idea was that if users wanted to remove all outliers, they would not use HDBSCAN at all since there are other clustering algorithms out there that do not assume outliers and might perform better.

So yes, this should definitely be addressed since setting a default threshold parameter is not feasible due to the different ranges of similarity scores across the strategies. If you want, a PR with the fix of setting self._outliers to 0 after removing all outliers through .update_topics would be appreciated. I will then go through some use cases manually and check some variables to see if everything works out. Can take a while though 😅

Aratako commented 1 year ago

Thank you. I have created a Pull Request for this fix. (#1466)