embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark
https://arxiv.org/abs/2210.07316
Apache License 2.0
1.64k stars 212 forks source link

Discussion regarding Quora and its place as a Retrieval task #1050

Closed tomaarsen closed 1 day ago

tomaarsen commented 5 days ago

Hello!

Although the Quora dataset (quora, quora-qrels) consists of queries like you'd find in retrieval, it really doesn't correspond with a normal query-passage retrieval task whatsoever. It corresponds much more closely with e.g. STS, although the dataset format doesn't match with it.

It's rather hard to argue to move one of the BEIR datasets out of the MTEB Retrieval tab (as right now we can nicely use that MTEB Retrieval == BEIR), but it's perhaps something to be mindful of.

See also a discussion here after the Quora dataset was used as a "retrieval benchmark" & its place in MTEB: https://x.com/jobergum/status/1809157587612336402

cc @jobergum @bclavie @bwanglzu

KennethEnevoldsen commented 5 days ago

Thanks for opening this @tomaarsen.

We are currently creating MTEB lite, this might be a reasonable time to consider removing it for the lite version (see also #837). From #837 it seems like one of the more redundant tasks. I have also considered removing some of the common training dataset like MSMARCO to make it a few-shot benchmark (without removing all models). WDYT? (cc @vaibhavad)

@Muennighoff would love your opinion on this as well?

Muennighoff commented 4 days ago

Great point! I think of it in terms of asymmetric vs symmetric retrieval tasks - From SGPT:

Asymmetric Search means queries and documents are not interchangeable. Finding answers given a question is an asymmetric search problem. Commonly, documents are much longer than queries [ 44 ]. We evaluate asymmetric search experiments on BEIR [ 44 ], a recently proposed benchmark consisting of 19 asymmetric search datasets. Symmetric Search means queries and documents are interchangeable. Finding duplicate questions, where both queries and documents are questions, is a symmetric search problem. We evaluate symmetric search experiments on USEB [ 49 ], Quora from BEIR [44 ] and STS-B [7 ]. In Quora, queries are question titles and documents are question texts. They are often the same with average word lengths of 9.53 and 11.44, respectively [44 ]. Hence, we consider it more of a symmetric search task. We include Quora in both symmetric and asymmetric experiments.

The MTEB paper also mentions this w.r.t. Quora. I do think that Quora is a "Retrieval" task, where with "Retrieval" I mean that given some input, I'd like to find a fitting result from some corpus which is still the case. Quora probably uses sth like this in production where when a user enters a question it retrieves similar questions in case it has already been answered; a QuoraSTS task would no longer represent this task well I think. Also, as the snippet mentions, Quora is not even purely symmetric as the documents are question texts, while queries are the titles.

I don't think we have other symmetric retrieval tasks, but if we had, it could be worth having this in the metadata & maybe allow for separate leaderboards.

jobergum commented 2 days ago

I think it's fine. It is a retrieval task.

You cannot prevent that organizations or people use subsets of BEIR/MTEB in marketing, without explaining what the dataset actually is.

bwanglzu commented 2 days ago

thanks for @tomaarsen creating the issue!

In general i personal believe it's more towards to a STS task, but i get your point @Muennighoff ! I suggest let's leave it as it is and see if the discussion comes back in the future.

bclavie commented 2 days ago

Thanks @tomaarsen!

While I did somewhat spark this discussion, I kind agree with everyone. It's not a super representative retrieval task, but I think it's valid to interpret it as both STS and retrieval.

As part of a varied benchmark, it doesn't bother me, and I even think it has great value there. Sadly, like @jobergum said, there's no way to stop anyone from cherry picking results for marketing purposes, so it shouldn't impact benchmark design decisions.

tomaarsen commented 2 days ago

I think we're all mostly in agreeance that although it's perhaps not ideal; we should keep the dataset as-is. I also think there's not much benefit in a separate leaderboard for symmetric retrieval. For MTEB lite we can certainly remove it, though. It's one of the less interesting retrieval tasks, after all.

I'm happy to close this now.

KennethEnevoldsen commented 1 day ago

Thanks for bringing up the discussion @tomaarsen, I will remove it from the MTEB lite pool of tasks to select from. Will close this issue for now, but do feel free to add any additional comments you might have.