embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark
https://arxiv.org/abs/2210.07316
Apache License 2.0
1.95k stars 272 forks source link

Issues with benchmarks.py #1467

Open Muennighoff opened 19 hours ago

Muennighoff commented 19 hours ago
isaac-chung commented 3 hours ago

Hmm nice catch!

  1. I agree that they should be swapped.
  2. I think MTEB_EN is the one from MMTEB. It seems like the real discrepancy right now it only runs everything that has English language. See STS17 below: I think in the original MTEB paper, all 11 subsets are run.

Zoomed-in:

STS
    - BIOSSES, s2s
    - SICK-R, s2s
    - STS12, s2s
    - STS13, s2s
    - STS14, s2s
    - STS15, s2s
    - STS16, s2s
    - STS17, s2s, multilingual 8 / 11 Subsets
    - STS22, p2p, multilingual 5 / 18 Subsets
    - STSBenchmark, s2s
``` In [1]: import mteb ...: benchmark = mteb.get_benchmark("MTEB(eng, classic)") ...: evaluation = mteb.MTEB(tasks=benchmark) In [2]: evaluation Out[2]: In [3]: benchmark Out[3]: Benchmark(name='MTEB(eng, classic)', tasks=MTEBTasks(AmazonCounterfactualClassification(name='AmazonCounterfactualClassification', languages=['eng']), AmazonPolarityClassification(name='AmazonPolarityClassification', languages=['eng']), AmazonReviewsClassification(name='AmazonReviewsClassification', languages=['eng']), ArguAna(name='ArguAna', languages=['eng']), ArxivClusteringP2P(name='ArxivClusteringP2P', languages=['eng']), ArxivClusteringS2S(name='ArxivClusteringS2S', languages=['eng']), AskUbuntuDupQuestions(name='AskUbuntuDupQuestions', languages=['eng']), BiossesSTS(name='BIOSSES', languages=['eng']), Banking77Classification(name='Banking77Classification', languages=['eng']), BiorxivClusteringP2P(name='BiorxivClusteringP2P', languages=['eng']), BiorxivClusteringS2S(name='BiorxivClusteringS2S', languages=['eng']), CQADupstackAndroidRetrieval(name='CQADupstackAndroidRetrieval', languages=['eng']), CQADupstackEnglishRetrieval(name='CQADupstackEnglishRetrieval', languages=['eng']), CQADupstackGamingRetrieval(name='CQADupstackGamingRetrieval', languages=['eng']), CQADupstackGisRetrieval(name='CQADupstackGisRetrieval', languages=['eng']), CQADupstackMathematicaRetrieval(name='CQADupstackMathematicaRetrieval', languages=['eng']), CQADupstackPhysicsRetrieval(name='CQADupstackPhysicsRetrieval', languages=['eng']), CQADupstackProgrammersRetrieval(name='CQADupstackProgrammersRetrieval', languages=['eng']), CQADupstackStatsRetrieval(name='CQADupstackStatsRetrieval', languages=['eng']), CQADupstackTexRetrieval(name='CQADupstackTexRetrieval', languages=['eng']), CQADupstackUnixRetrieval(name='CQADupstackUnixRetrieval', languages=['eng']), CQADupstackWebmastersRetrieval(name='CQADupstackWebmastersRetrieval', languages=['eng']), CQADupstackWordpressRetrieval(name='CQADupstackWordpressRetrieval', languages=['eng']), ClimateFEVER(name='ClimateFEVER', languages=['eng']), DBPedia(name='DBPedia', languages=['eng']), EmotionClassification(name='EmotionClassification', languages=['eng']), FEVER(name='FEVER', languages=['eng']), FiQA2018(name='FiQA2018', languages=['eng']), HotpotQA(name='HotpotQA', languages=['eng']), ImdbClassification(name='ImdbClassification', languages=['eng']), MSMARCO(name='MSMARCO', languages=['eng']), MTOPDomainClassification(name='MTOPDomainClassification', languages=['eng']), MTOPIntentClassification(name='MTOPIntentClassification', languages=['eng']), MassiveIntentClassification(name='MassiveIntentClassification', languages=['eng']), MassiveScenarioClassification(name='MassiveScenarioClassification', languages=['eng']), MedrxivClusteringP2P(name='MedrxivClusteringP2P', languages=['eng']), MedrxivClusteringS2S(name='MedrxivClusteringS2S', languages=['eng']), MindSmallReranking(name='MindSmallReranking', languages=['eng']), NFCorpus(name='NFCorpus', languages=['eng']), NQ(name='NQ', languages=['eng']), QuoraRetrieval(name='QuoraRetrieval', languages=['eng']), RedditClustering(name='RedditClustering', languages=['eng']), RedditClusteringP2P(name='RedditClusteringP2P', languages=['eng']), SCIDOCS(name='SCIDOCS', languages=['eng']), SickrSTS(name='SICK-R', languages=['eng']), STS12STS(name='STS12', languages=['eng']), STS13STS(name='STS13', languages=['eng']), STS14STS(name='STS14', languages=['eng']), STS15STS(name='STS15', languages=['eng']), STS16STS(name='STS16', languages=['eng']), STS17Crosslingual(name='STS17', languages=['ara', 'deu', 'eng', '...']), STS22CrosslingualSTS(name='STS22', languages=['cmn', 'deu', 'eng', '...']), STSBenchmarkSTS(name='STSBenchmark', languages=['eng']), SciDocsReranking(name='SciDocsRR', languages=['eng']), SciFact(name='SciFact', languages=['eng']), SprintDuplicateQuestionsPC(name='SprintDuplicateQuestions', languages=['eng']), StackExchangeClustering(name='StackExchangeClustering', languages=['eng']), StackExchangeClusteringP2P(name='StackExchangeClusteringP2P', languages=['eng']), StackOverflowDupQuestions(name='StackOverflowDupQuestions', languages=['eng']), SummEvalSummarization(name='SummEval', languages=['eng']), TRECCOVID(name='TRECCOVID', languages=['eng']), Touche2020v3Retrieval(name='Touche2020Retrieval.v3', languages=['eng']), ToxicConversationsClassification(name='ToxicConversationsClassification', languages=['eng']), TweetSentimentExtractionClassification(name='TweetSentimentExtractionClassification', languages=['eng']), TwentyNewsgroupsClustering(name='TwentyNewsgroupsClustering', languages=['eng']), TwitterSemEval2015PC(name='TwitterSemEval2015', languages=['eng']), TwitterURLCorpusPC(name='TwitterURLCorpus', languages=['eng'])), description='The original English benchmarks by Muennighoff et al., (2023).', reference=None, citation='@inproceedings{muennighoff-etal-2023-mteb,\n title = "{MTEB}: Massive Text Embedding Benchmark",\n author = "Muennighoff, Niklas and\n Tazi, Nouamane and\n Magne, Loic and\n Reimers, Nils",\n editor = "Vlachos, Andreas and\n Augenstein, Isabelle",\n booktitle = "Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics",\n month = may,\n year = "2023",\n address = "Dubrovnik, Croatia",\n publisher = "Association for Computational Linguistics",\n url = "https://aclanthology.org/2023.eacl-main.148",\n doi = "10.18653/v1/2023.eacl-main.148",\n pages = "2014--2037",\n}\n') In [4]: model_name = "average_word_embeddings_komninos" In [5]: from sentence_transformers import SentenceTransformer In [6]: model = SentenceTransformer(model_name) In [7]: results = evaluation.run(model) ──────────────────────────────────────────────────────────────────────────── Selected tasks ───────────────────────────────────────────────────────────────────────────── Classification - AmazonCounterfactualClassification, s2s, multilingual 2 / 4 Subsets - AmazonPolarityClassification, p2p - AmazonReviewsClassification, s2s, multilingual 1 / 6 Subsets - Banking77Classification, s2s - EmotionClassification, s2s - ImdbClassification, p2p - MTOPDomainClassification, s2s, multilingual 1 / 6 Subsets - MTOPIntentClassification, s2s, multilingual 1 / 6 Subsets - MassiveIntentClassification, s2s, multilingual 1 / 51 Subsets - MassiveScenarioClassification, s2s, multilingual 1 / 51 Subsets - ToxicConversationsClassification, s2s - TweetSentimentExtractionClassification, s2s Clustering - ArxivClusteringP2P, p2p - ArxivClusteringS2S, s2s - BiorxivClusteringP2P, p2p - BiorxivClusteringS2S, s2s - MedrxivClusteringP2P, p2p - MedrxivClusteringS2S, s2s - RedditClustering, s2s - RedditClusteringP2P, p2p - StackExchangeClustering, s2s - StackExchangeClusteringP2P, p2p - TwentyNewsgroupsClustering, s2s PairClassification - SprintDuplicateQuestions, s2s - TwitterSemEval2015, s2s - TwitterURLCorpus, s2s Reranking - AskUbuntuDupQuestions, s2s - MindSmallReranking, s2s - SciDocsRR, s2s - StackOverflowDupQuestions, s2s Retrieval - ArguAna, s2p - CQADupstackAndroidRetrieval, s2p - CQADupstackEnglishRetrieval, s2p - CQADupstackGamingRetrieval, s2p - CQADupstackGisRetrieval, s2p - CQADupstackMathematicaRetrieval, s2p - CQADupstackPhysicsRetrieval, s2p - CQADupstackProgrammersRetrieval, s2p - CQADupstackStatsRetrieval, s2p - CQADupstackTexRetrieval, s2p - CQADupstackUnixRetrieval, s2p - CQADupstackWebmastersRetrieval, s2p - CQADupstackWordpressRetrieval, s2p - ClimateFEVER, s2p - DBPedia, s2p - FEVER, s2p - FiQA2018, s2p - HotpotQA, s2p - MSMARCO, s2p - NFCorpus, s2p - NQ, s2p - QuoraRetrieval, s2s - SCIDOCS, s2p - SciFact, s2p - TRECCOVID, s2p - Touche2020Retrieval.v3, s2p STS - BIOSSES, s2s - SICK-R, s2s - STS12, s2s - STS13, s2s - STS14, s2s - STS15, s2s - STS16, s2s - STS17, s2s, multilingual 8 / 11 Subsets - STS22, p2p, multilingual 5 / 18 Subsets - STSBenchmark, s2s Summarization - SummEval, p2p ```