Open ozancaglayan opened 3 years ago
The STS tasks (like STS12) are split in several sub-datasets. In the original STS12 workshop, the main evaluation method was to combine all datasets to a long list of pairs, compute the similarity score, and then compute the correlation to the gold scores.
SentEval deviates from this and computes the correlation on each sub-dataset individually, and then performs an average. This can have quite a big impact, either positive or negative, based on the model.
In the SBERT paper, I followed the recommendation from the original STS shared task, combine all datasets, and then compute one correlation score.
Thanks. This is what has been done in the simCSE paper, referred to as All
approach. They have their code implementing this approach but still I can't get your numbers. Do you have the code from your paper? If yes, could you open-source it?
Have you tried the average glove embedding model provided in this repo? It removes stop-words, which improves the performance for avg. word embeddings quite a lot.
Sorry, checked every folder in the codebase but couldn't find that model here.
https://www.sbert.net/docs/pretrained_models.html#average-word-embeddings-models
Yes, this model can reproduce the results close to your paper. Thanks a lot.
@nreimers I actually have the opposite issue. The GloVe results I reproduced are better than the ones reported in the paper. I get 60.82 for STSb (58.02 reported) and 55.50 for SICK-R (53.76 reported). Is this because you made the stop-word improvement after the paper?
Yeah, I get 61.54 and I don't know why it's higher than the one on the paper. I wish there was a script for every result presented in the paper.
Hello,
I was reading the recent simCSE paper which referred to your paper when reporting the Average GloVe embedding results for the STS benchmarks. I originally created the issue in their repository to ask but this may be a better place: How can I reproduce those numbers because by using the SentEval copy in simCSE's repository, I get very poor correlation numbers compared to your paper.
Thanks