embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark
https://arxiv.org/abs/2210.07316
Apache License 2.0
1.8k stars 236 forks source link

Aggregating MMTEB datasets #354

Open orionw opened 5 months ago

orionw commented 5 months ago

Some MMTEB datasets include only some of their respective many languages.

For example Multilingual Miracl in MMTEB contains only German and Spanish although it has many more. Others like Korean Mr.TyDI are included even though it is multilingual. Same with Korean Miracl which is included separately from the multilingual Miracl.

These should be consolidated into multilingual datasets with all their available languages (18 in Miracl and 11 in Mr. TyDI).

I can start working on this next week, unless someone beats me to it.

orionw commented 5 months ago

Assuming @KennethEnevoldsen agrees

izhx commented 5 months ago

Nandan Thakur may have started https://github.com/embeddings-benchmark/mteb/issues/198#issuecomment-2050189570 on Miracl

Merging Mr.TyDI seems like a good idea. I think we would prefer not to run these massive retrieval tasks too many times...

KennethEnevoldsen commented 5 months ago

Assuming @KennethEnevoldsen agrees

Completely agree, we should merge the relevant datasets. Feel free to merge Mr.TyDI

KennethEnevoldsen commented 5 months ago

I think we would prefer not to run these massive retrieval tasks too many times...

If the dataset is massive we might consider reducing the size as well.

orionw commented 5 months ago

On a closer reading of the MiRACL paper, it seems like they expanded the Mr. TyDi work with more annotations. Assuming that is true and they have the same query set (@thakur-nandan) I would suggest we remove any Mr. TyDi datasets and only include MiRACL.

thakur-nandan commented 5 months ago

Hi @orionw and others, I wasn't aware of this issue thread. including @crystina-z to the thread as well.

Yes, My suggestion is what @orionw suggested. MIRACL was built on top of Mr. TyDI. MIRACL covers more languages, and denser annotations and queries present in Mr. TyDI are already available within MIRACL. So, include MIRACL instead of Mr. TyDI in MMTEB.

@orionw instead of the MIRACL Arxiv, you can read the latest paper on TACL with more information: https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00595/117438/MIRACL-A-Multilingual-Retrieval-Dataset-Covering

crystina-z commented 5 months ago

heyy! glad to see both datasets are considered! Yes MIRACL is expanded on Mr. TyDi so it makes sense to only include MIRACL for the evaluation.

Assuming that is true and they have the same query set

Just to add on this: although MIRACL includes most (> 90% for most languages) of the dev queries in Mr. TyDi, they are not completely identical: some queries were skipped since we couldn't find positive passage for them in the MIRACL corpus. Here's the stats (number of queries in the dev set):

  MIRACL (dev) MrTyDI (dev) Percentage
ar 2,896 3,115 93.0%
bn 411 440 93.4%
en 799 878 91.0%
fi 1,271 1,738 73.1%
id 960 1,224 78.4%
ja 860 928 92.7%
ko 213 303 70.3%
ru 1,252 1,375 91.1%
sw 482 526 91.6%
te 828 983 84.2%
th 733 807 90.8%
izhx commented 5 months ago

I think we would prefer not to run these massive retrieval tasks too many times...

If the dataset is massive we might consider reducing the size as well.

I can run some filtering to find how many docs are considered by most embedding models and sparse retrievers to be irrelevant to any query. I think maybe we can remove these docs and create a slim version of collection (corpus).

Will come back with the statistics

orionw commented 5 months ago

Thanks @izhx!

I think I missed some of the earlier conversations about MMTEB, but in my experience removing passages that are non-relevant will greatly change the difficulty of the retrieval task (removing difficult non-relevant documents). AFAIK, the tasks in MTEB still have the same corpus size, even in the millions for some.

If we need to remove something from retrieval datasets to make it run faster I would suggest the queries. MiRACL has ~1k or less queries per lang, IDK if that's too large for our usage but it's on par with many in MTEB. Did we decide on some threshold for size earlier?

izhx commented 5 months ago

@orionw No, we do not have some threshold to the corpus size. It's just a random thought of mine.

I apologize if I haven't made myself clear. What I intend to do is eliminate the documents that are least relevant (very low embedding dot score, e.g. ranked 5M of all 8M docs) to any query (for most models). I suppose this might not touch the hard-negative docs, so it might not affect the difficulty (too much) of the task.

Currently, in retrieval evaluator, we first encode a small number of queries (seconds), then encode and search through a massive corpus in chunks (which is the most expensive part, up to hundreds of minutes). Reducing these easy negative docs could be an effective way to enhance efficiency without minimal impacts of other parts.

This is just my initial understanding. Welcome further discussion


I'm not sure how many of these seemingly dispensable (in my intuition) docs there are, probably not many.

crystina-z commented 5 months ago

Hi @izhx @orionw! me and @thakur-nandan had a quick discussion, and we'll vote for not touching the size of the corpus or queries, as that would cause the evaluation scores to no longer be comparable to the current MIRACL benchmark.

How do you think an alternative solution that we only include the languages with a large corpus (e.g., en, fr, es, ru, ja, zh) in the reranking but not the retrieval task? so we'll save time on retrieval, but still have the full language coverage on reranking.

orionw commented 5 months ago

I would also prefer not changing them, if possible, for the same reasons. Maybe we add them all for now and if inference is too slow once the benchmark is complete it’s easy to remove them from retrieval?

crystina-z commented 5 months ago

Yup that sounds good to us

izhx commented 5 months ago

Thanks, I agree!

taeminlee commented 5 months ago

I fully agree with the integration of Korean Miracl into MMTEB's Multilingual Miracl.


Miracl and Mr.TyDI's documentation had a lot of duplication because they use almost the same wikipedia snapshot. I haven't seen the full MTEB source code yet, so I don't know if it's something that already exists, but if we implemented a caching mechanism for document and query, I wonder if it wouldn't cost almost as much to run two datasets as it would to run one, rather than the sum of their parts.

If implemented, I think it would be ideal to use a key-value store like redis, where the key would be the hash value of the document and the value would be the vector of the embedding model. For simplicity of implementation, diskcache might also be worth considering.

orionw commented 5 months ago

Great idea @taeminlee! I wonder if there are other datasets with similar redundancies we might be able to take advantage of as well. Probably worth making a separate issue and we can track them down there!

KennethEnevoldsen commented 5 months ago

I have created an issue over at #375

crystina-z commented 5 months ago

Brilliant idea! If the cache could be made, we'll save over 90% space for Mr. Tydi and MIRACL.