MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.08k stars 757 forks source link

Suggestion: Clip default hierarchy distance function cosine similarity #1573

Open sdehm opened 1 year ago

sdehm commented 1 year ago

It appears that the sklearn cosine_similarity function can sometimes output values just slightly above 1. This has been reported and is under discussion within scikit-learn (https://github.com/scikit-learn/scikit-learn/issues/18122). This has led to a few submitted issues reported on this repo (https://github.com/MaartenGr/BERTopic/issues/1137, https://github.com/MaartenGr/BERTopic/issues/1418, and https://github.com/MaartenGr/BERTopic/issues/1319 for example) with a suggested workaround of using the absolute value of the default distance function.

Based on the direction it looks like they may take in the scikit-learn issue to address this I am proposing a slight change to the default distance function to clip the output of the cosine similarity function to the expected value ranges of -1 to 1 which would eliminate this issue and possibly prevent more issues from being reported.

    if distance_function is None:
        distance_function = lambda x: 1 - cosine_similarity(x)

would become

    if distance_function is None:
        distance_function = lambda x: 1 - np.clip(cosine_similarity(x), -1, 1)

I know that if https://github.com/scikit-learn/scikit-learn/issues/18122 is fixed this will no longer be necessary and we can always supply the custom distance function ourselves. Therefore I totally understand if adding this workaround isn't considered necessary right now but I wanted to propose this regardless to at least hear opinions on this approach. If this is something that is desired I'm happy to create the PR.

My understanding of cosine similarity is that there is no reason for the values to be outside of the -1 to 1 range besides floating point rounding errors so there is no risk to the behavior in clipping the output. Is this correct?

Thank you!

MaartenGr commented 1 year ago

Thanks for the extensive description! It is indeed a tricky problem one I would hope were to be fixed in scikit-learn.

My understanding of cosine similarity is that there is no reason for the values to be outside of the -1 to 1 range besides floating point rounding errors so there is no risk to the behavior in clipping the output. Is this correct?

The one thing that holds me back a bit is that the distribution might still have meaning even though they are outside the range. For example, in the issue you mentioned I saw a user had a similarity measure of 1.1. If another similarity measure would be 1.05 then clipping both of them would result in both having the same value. If the underlying bug means that the values of 1.1 versus 1.05 would be actual distributions and 1.1 would indicate a higher score than 1.05, then clipping would remove that information.

I am not sure where the issue stems from seeing as getting a value of 1.1 is quite high if it were only a rounding error which makes me hesitant to clip it by default.