Using Manhattan-Distance instead of Cosine-Similarity

PhilipMay commented 1 year ago

Hi, I am using MultipleNegativesRankingLoss to train a German Bert model (deepset/gbert-base) on German sentence pairs.

During the training I am evaluating on German stsb data. My observation is this:

2022-11-01 13:47:54 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-dev dataset in epoch 0 after 33 steps:                                                                      
2022-11-01 13:47:56 - Cosine-Similarity :       Pearson: 0.6679 Spearman: 0.6748                                                                                                            
2022-11-01 13:47:56 - Manhattan-Distance:       Pearson: 0.6705 Spearman: 0.6845                                                                                                            
2022-11-01 13:47:56 - Euclidean-Distance:       Pearson: 0.6688 Spearman: 0.6824                                                                                                            
2022-11-01 13:47:56 - Dot-Product-Similarity:   Pearson: 0.3988 Spearman: 0.3831                                                                                                            
2022-11-01 13:48:13 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-dev dataset in epoch 0 after 66 steps:                                                                      
2022-11-01 13:48:15 - Cosine-Similarity :       Pearson: 0.6765 Spearman: 0.6822                                                                                                            
2022-11-01 13:48:15 - Manhattan-Distance:       Pearson: 0.6795 Spearman: 0.6918                                                                                                            
2022-11-01 13:48:15 - Euclidean-Distance:       Pearson: 0.6778 Spearman: 0.6900                                                                                                            
2022-11-01 13:48:15 - Dot-Product-Similarity:   Pearson: 0.4126 Spearman: 0.3954                                                                                                            
2022-11-01 13:48:33 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-dev dataset in epoch 0 after 99 steps:                                                                      
2022-11-01 13:48:35 - Cosine-Similarity :       Pearson: 0.6811 Spearman: 0.6862                                                                                                            
2022-11-01 13:48:35 - Manhattan-Distance:       Pearson: 0.6848 Spearman: 0.6949                                                                                                            
2022-11-01 13:48:35 - Euclidean-Distance:       Pearson: 0.6832 Spearman: 0.6936                                                                                                            
2022-11-01 13:48:35 - Dot-Product-Similarity:   Pearson: 0.4248 Spearman: 0.4078                                                                                                            
2022-11-01 13:48:53 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-dev dataset in epoch 0 after 132 steps:                                                                     
2022-11-01 13:48:55 - Cosine-Similarity :       Pearson: 0.6880 Spearman: 0.6922                                                                                                            
2022-11-01 13:48:55 - Manhattan-Distance:       Pearson: 0.6940 Spearman: 0.7024                                                                                                            
2022-11-01 13:48:55 - Euclidean-Distance:       Pearson: 0.6928 Spearman: 0.7011                                                                                                            
2022-11-01 13:48:55 - Dot-Product-Similarity:   Pearson: 0.4372 Spearman: 0.4197

It seems like the Manhattan-Distance and Euclidean-Distance is a better distance metric than Cosine-Similarity.

For me, this result is really strange. Doesn't this indirectly mean that it would be better to use Manhattan-Distance or Euclidean-Distance for the loss function as well - during training? Only problem: The distance is not normalized between 0 and 1 like the Cosine-Similarity.

Is there a solution & explanation for this?

PhilipMay commented 1 year ago

This is also connected to something strange I observed at SetFit here: https://github.com/huggingface/setfit/issues/135#issuecomment-1297000383

Maybe @nreimers could comment this? :-)

TheTamT3 commented 1 year ago

I think it is depended on loss function. Example, if you use Triplet-loss with distance-metric, Manhattan or Euclid will be better than cosine

UKPLab / sentence-transformers

Using Manhattan-Distance instead of Cosine-Similarity #1741