UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.17k stars 2.47k forks source link

reproducing GloVe avg. emb results on STS #1165

Open ozancaglayan opened 3 years ago

ozancaglayan commented 3 years ago

Hello,

I was reading the recent simCSE paper which referred to your paper when reporting the Average GloVe embedding results for the STS benchmarks. I originally created the issue in their repository to ask but this may be a better place: How can I reproduce those numbers because by using the SentEval copy in simCSE's repository, I get very poor correlation numbers compared to your paper.

Thanks

nreimers commented 3 years ago

The STS tasks (like STS12) are split in several sub-datasets. In the original STS12 workshop, the main evaluation method was to combine all datasets to a long list of pairs, compute the similarity score, and then compute the correlation to the gold scores.

SentEval deviates from this and computes the correlation on each sub-dataset individually, and then performs an average. This can have quite a big impact, either positive or negative, based on the model.

In the SBERT paper, I followed the recommendation from the original STS shared task, combine all datasets, and then compute one correlation score.

ozancaglayan commented 3 years ago

Thanks. This is what has been done in the simCSE paper, referred to as All approach. They have their code implementing this approach but still I can't get your numbers. Do you have the code from your paper? If yes, could you open-source it?

nreimers commented 3 years ago

Have you tried the average glove embedding model provided in this repo? It removes stop-words, which improves the performance for avg. word embeddings quite a lot.

ozancaglayan commented 3 years ago

Sorry, checked every folder in the codebase but couldn't find that model here.

nreimers commented 3 years ago

https://www.sbert.net/docs/pretrained_models.html#average-word-embeddings-models

336655asd commented 2 years ago

https://www.sbert.net/docs/pretrained_models.html#average-word-embeddings-models

Yes, this model can reproduce the results close to your paper. Thanks a lot.

Zoher15 commented 1 year ago

@nreimers I actually have the opposite issue. The GloVe results I reproduced are better than the ones reported in the paper. I get 60.82 for STSb (58.02 reported) and 55.50 for SICK-R (53.76 reported). Is this because you made the stop-word improvement after the paper?

batubb commented 2 months ago

Yeah, I get 61.54 and I don't know why it's higher than the one on the paper. I wish there was a script for every result presented in the paper.