Open sdehm opened 1 year ago
Thanks for the extensive description! It is indeed a tricky problem one I would hope were to be fixed in scikit-learn.
My understanding of cosine similarity is that there is no reason for the values to be outside of the -1 to 1 range besides floating point rounding errors so there is no risk to the behavior in clipping the output. Is this correct?
The one thing that holds me back a bit is that the distribution might still have meaning even though they are outside the range. For example, in the issue you mentioned I saw a user had a similarity measure of 1.1. If another similarity measure would be 1.05 then clipping both of them would result in both having the same value. If the underlying bug means that the values of 1.1 versus 1.05 would be actual distributions and 1.1 would indicate a higher score than 1.05, then clipping would remove that information.
I am not sure where the issue stems from seeing as getting a value of 1.1 is quite high if it were only a rounding error which makes me hesitant to clip it by default.
It appears that the sklearn
cosine_similarity
function can sometimes output values just slightly above 1. This has been reported and is under discussion within scikit-learn (https://github.com/scikit-learn/scikit-learn/issues/18122). This has led to a few submitted issues reported on this repo (https://github.com/MaartenGr/BERTopic/issues/1137, https://github.com/MaartenGr/BERTopic/issues/1418, and https://github.com/MaartenGr/BERTopic/issues/1319 for example) with a suggested workaround of using the absolute value of the default distance function.Based on the direction it looks like they may take in the scikit-learn issue to address this I am proposing a slight change to the default distance function to clip the output of the cosine similarity function to the expected value ranges of -1 to 1 which would eliminate this issue and possibly prevent more issues from being reported.
would become
I know that if https://github.com/scikit-learn/scikit-learn/issues/18122 is fixed this will no longer be necessary and we can always supply the custom distance function ourselves. Therefore I totally understand if adding this workaround isn't considered necessary right now but I wanted to propose this regardless to at least hear opinions on this approach. If this is something that is desired I'm happy to create the PR.
My understanding of cosine similarity is that there is no reason for the values to be outside of the -1 to 1 range besides floating point rounding errors so there is no risk to the behavior in clipping the output. Is this correct?
Thank you!