Cannot reproduce results of SBERT-STSb-base on STS benchmark

LittletreeZou commented 3 years ago

Hi Nils，

Recently I read your papers about SBERT and I find them very interesting. But I meet with some problems when I try to reproduce some of the experiment results in your paper.

In paper Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, the SBERT model that trained on STS benchmark dataset (SBERT-STSb-base) achieved 84.67 spearman correlations, slightly better that the BERT trained on the same data(BERT-STSb-base). I can reproduce the results of BERT, but I cannot reproduce the results of SBERT on STS benchmark. I tried your code and I also implemented mine follows your idea, but both of them performed badly(spearman correlations 63.61).

In paper Augment SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks, there is a direct comparison of BERT and SBERT's performance on STS benckmark. Here the SBERT's performance when using all the trainging data is about 75 spearman correlations, which seems more reasonable.

So how can Siamese BERT that just fine tuned on STS benchmark dataset achieved 84.67 spearman correlations? Could you share your hyperparamers ? Or is there any tricks in training?Thank you very much.

nreimers commented 3 years ago

You can find the code here: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/sts/training_stsbenchmark.py

Using bert-base-uncased as model, it yields:

2021-03-09 08:55:08 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-test dataset:
2021-03-09 08:55:11 - Cosine-Similarity :       Pearson: 0.8467 Spearman: 0.8403
2021-03-09 08:55:11 - Manhattan-Distance:       Pearson: 0.8252 Spearman: 0.8200
2021-03-09 08:55:11 - Euclidean-Distance:       Pearson: 0.8259 Spearman: 0.8208
2021-03-09 08:55:11 - Dot-Product-Similarity:   Pearson: 0.7491 Spearman: 0.7419

Scores can be a little bit lower than what is reported in the paper. In the paper, we used pytorch-pretrained-BERT, which was the first name for huggingface transformers repository. With version 1 (or 2?) of the repository, performances slightly dropped for several projects due to unclear reasons. When using the old sentence-transformer and pytorch-pretrained-BERT versions, results are still reproducable.

Not sure about the results in the Augmented SBERT paper. Need to discuss with with Nandan.

LittletreeZou commented 3 years ago

I get the result now. Thanks a lot !

UKPLab / sentence-transformers

Cannot reproduce results of SBERT-STSb-base on STS benchmark #800