Closed Aratako closed 1 year ago
Although I think the solution you propose makes sense, I am not entirely sure how that would affect other parts of the topic model. There are quite a few places where self._outliers
is being used and we would have to be very confident that no mention of outliers whatsoever is found anywhere after .update_topics
.
I think this needs to be extensively tested across a couple of use cases where self._outliers
is being used in order to make sure that this update does not negatively influence other functions. What do you think?
Yes, I agree with you. However, self._outliers
is originally set to 0 when there are no outliers (for instance, when using k-means for clustering), so I believe the possibility of errors arising from this code change is relatively low. Of course, testing is still necessary.
In line 3234 of _bertopic.py
, in the _cluster_embeddings()
method, the same code is implemented with a comment:
https://github.com/MaartenGr/BERTopic/blob/37064e231defceb998ef2a778cef46fc769dbf00/bertopic/_bertopic.py#L3231-L3234
Yes, I agree with you. However, self._outliers is originally set to 0 when there are no outliers (for instance, when using k-means for clustering), so I believe the possibility of errors arising from this code change is relatively low. Of course, testing is still necessary.
I am not entirely sure about this. It is not so much about whether self._outliers
is 0 but whether there are other places outliers are found. For example, self.topics_
might contain outliers or self.topic_embeddings_
might also include the outliers topic. The same applies with self.ctfidf_
and a bunch or other attributes.
In line 3234 of _bertopic.py, in the _cluster_embeddings() method, the same code is implemented with a comment:
Here, it is safe to do this because this is the very first place where you will actually found outliers to be generated. After that, attributes are updated based on the existence of outliers or not. These updated attributes should be checked if we are going to set self._outliers = 0
manually afterwards. That is why it is so important to check all attributes and see whether they all contain some mention of an outlier topic after having fully reduced outliers.
I understand your concerns. Indeed, it seems necessary to carefully check the impact that the modification would have. I have performed unit testing after making this update, and no errors were found. However, I am also unaware of the internal effects that this fix is having. Unfortunately, since I am not familiar with this library, further checks seem difficult.
By the way, is it unusual behavior for all outliers to be eliminated by reduce_outliers()
? I have reduced outliers for two datasets using the strategy="embeddings"
, and in both cases, all the outliers were gone. If this is intended behavior and occurs frequently, addressing this issue could be crucial.
I have performed unit testing after making this update, and no errors were found.
Unfortunately, I do not think that the unit tests cover that specific case, so it can easily be missed. Feel free to make the PR request and then if I can find sometime in the upcoming weeks, I can manually check what needs to be checked.
By the way, is it unusual behavior for all outliers to be eliminated by reduce_outliers()? I have reduced outliers for two datasets using the strategy="embeddings", and in both cases, all the outliers were gone. If this is intended behavior and occurs frequently, addressing this issue could be crucial.
Kinda a poor interface on my part. I figured that users would manually tweak the threshold parameter to reduce some outliers but not all. The idea was that if users wanted to remove all outliers, they would not use HDBSCAN at all since there are other clustering algorithms out there that do not assume outliers and might perform better.
So yes, this should definitely be addressed since setting a default threshold parameter is not feasible due to the different ranges of similarity scores across the strategies. If you want, a PR with the fix of setting self._outliers
to 0 after removing all outliers through .update_topics
would be appreciated. I will then go through some use cases manually and check some variables to see if everything works out. Can take a while though 😅
Thank you. I have created a Pull Request for this fix. (#1466)
I encountered an IndexError with the visualization of heatmap. Here's the sequence of steps I took and the resulting error. I instantiated a BERTopic model and ran the fit_transform() method as follows:
Subsequently, I executed the following code:
After that, when I used
topic_model.visualize_heatmap()
, I encountered this IndexError:I suspect the problem originates from
topic_model._outliers
. Before reducing outliers, the index was off by one compared to the topic number, as shown below (where the first column represents the index and the second column represents the topic number):After outlier reduction, it appears that all outliers were removed, resulting in the same index and topic number like below:
However,
topic_model._outliers
returned 1 for both patterns. In thevisualize_heatmap()
method of_heatmap.py
, the embeddings array is sliced usingtopic_model._outliers
likeembeddings = np.array(topic_model.topic_embeddings_)[topic_model._outliers:]
Both patterns produce an
indices
array like[ 0 1 2 3 ... 17 18]
. The shape ofembeddings
varies depending on the slice. With outliers,embeddings.shape
is(19, 768)
, but without outliers, it's(18, 768)
because of slicing. This discrepancy causes the IndexError in embeddings = embeddings[indices].I believe
topic_model._outliers
should be updated to 0 when all outliers are removed. A potential fix could involve adding the following line to theupdate_topics()
method in_bertopic.py
:I've tested this fix for my specific error and it seemed to resolve the issue. However, I haven't tested other parts of the library, so I'm unsure this is right solution. If it's appropriate, I'm willing to submit a PR.