embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark
https://arxiv.org/abs/2210.07316
Apache License 2.0
1.98k stars 276 forks source link

Add `descriptive_stats` to all tasks #1475

Open Samoed opened 6 days ago

Samoed commented 6 days ago

I've added descriptive statistics to almost all tasks, but I need help running some of them. To do this, you can run the script from the v2.0.0 branch.

import mteb
from tqdm import tqdm

for task_name in tqdm(
    [
        "FEVER",
        "HotpotQA",
        "MSMARCO",
        "MSMARCOv2",
        "TopiOCQA",
        "MIRACLRetrieval",
        "MrTidyRetrieval",
        "BrightRetrieval",
        "MultiLongDocRetrieval",
        "NeuCLIR2022Retrieval",
        "NeuCLIR2023Retrieval",
        "BibleNLPBitextMining",
        "FloresBitextMining",
        "SwissJudgementClassification",
        "MultiEURLEXMultilabelClassification",
        "MindSmallReranking",
        "WebLINXCandidatesReranking",
        "VoyageMMarcoReranking",
        "MIRACLReranking",
    ]
):
    task = mteb.get_task(task_name)
    stat = task.calculate_metadata_metrics()

cc @imenelydiaker

imenelydiaker commented 6 days ago

Hi @Samoed, thanks for all the efforts you put in! I'll run what's missing πŸ™‚

imenelydiaker commented 6 days ago

@Samoed question: do you know why we don't have the number of qrels for retieval datasets?

e.g., MIRACLRetrievalHardNegatives, we have num_queries and num_documents but not the number of qrels. We can maybe infer an average from average_relevant_docs_per_query, any idea why we don't have the exact number?

{
 'number_of_characters': 983901912,
 'num_samples': 2460458,
 'num_queries': 11076,
 'num_documents': 2449382,
 'min_document_length': 5,
 'average_document_length': 0.1694358005407078,
 'max_document_length': 176,
 'unique_documents': 2449382,
 'min_query_length': 1,
 'average_query_length': 88794.41124954858,
 'max_query_length': 48538,
 'unique_queries': 11076,
 'min_relevant_docs_per_query': 1,
 'average_relevant_docs_per_query': 2.3643011917659806,
 'max_relevant_docs_per_query': 20,
 'unique_relevant_docs': 98836,
 'num_instructions': None,
 'min_instruction_length': None,
 'average_instruction_length': None,
 'max_instruction_length': None,
 'unique_instructions': None,
 'min_top_ranked_per_query': None,
...
}
Samoed commented 6 days ago

I forgot to add themπŸ˜…. Can you add missing fileds? I can recompute rest of the tasks

dokato commented 6 days ago

@imenelydiaker how is this one going? do you need any help with that?

imenelydiaker commented 6 days ago

@imenelydiaker how is this one going? do you need any help with that?

Launched it one hour ago, it's running I'm at MSMarcov2, the datasets are quite huge. Maybe we can split? Do you think you can do datasets starting from Neuclir*?

imenelydiaker commented 6 days ago

@dokato if you still want to run some datasets, you'll need the code I added here to get the number of qrels.

dokato commented 6 days ago

So just checkout to this branch above and run the script above from Neuclir? Sure, I can do that!

imenelydiaker commented 6 days ago

So just checkout to this branch above and run the script above from Neuclir? Sure, I can do that!

Yes, better if you create a new branch from it so we won't have conflicts (because I'm working on it). For the PR, open it directly on the branch v2.0.0, thank you! πŸ™‚

dokato commented 4 days ago

@imenelydiaker @Samoed So in this PR I added 7/10 of datasets from my part (starting from NeuCLIR2022Retrieval). Sadly, those datasets below failed repeatedly due to lack of memory:

Currently, I have only access to D8s_v3 VM with 32gb of RAM and 8CPUs, 120gb SSD, so I'd say not too bad. So maybe some optimizations are needed, or even more beefy machine. I'll be OOK for the next 4 days, can take a look back at those if you have some pointers.

Samoed commented 4 days ago

I can try to run them on working machine

imenelydiaker commented 4 days ago

@dokato thank you so much! @Samoed let me know if you can't run them πŸ™‚

Samoed commented 2 days ago

I found a bug in the statistics calculation (I accidentally swapped docs and queries when calculating lengths) and will recalculate the incorrect stats