Closed tomaarsen closed 4 months ago
Thanks for opening this @tomaarsen.
We are currently creating MTEB lite, this might be a reasonable time to consider removing it for the lite version (see also #837). From #837 it seems like one of the more redundant tasks. I have also considered removing some of the common training dataset like MSMARCO to make it a few-shot benchmark (without removing all models). WDYT? (cc @vaibhavad)
@Muennighoff would love your opinion on this as well?
Great point! I think of it in terms of asymmetric vs symmetric retrieval tasks - From SGPT:
Asymmetric Search means queries and documents are not interchangeable. Finding answers given a question is an asymmetric search problem. Commonly, documents are much longer than queries [ 44 ]. We evaluate asymmetric search experiments on BEIR [ 44 ], a recently proposed benchmark consisting of 19 asymmetric search datasets. Symmetric Search means queries and documents are interchangeable. Finding duplicate questions, where both queries and documents are questions, is a symmetric search problem. We evaluate symmetric search experiments on USEB [ 49 ], Quora from BEIR [44 ] and STS-B [7 ]. In Quora, queries are question titles and documents are question texts. They are often the same with average word lengths of 9.53 and 11.44, respectively [44 ]. Hence, we consider it more of a symmetric search task. We include Quora in both symmetric and asymmetric experiments.
The MTEB paper also mentions this w.r.t. Quora. I do think that Quora is a "Retrieval" task, where with "Retrieval" I mean that given some input, I'd like to find a fitting result from some corpus which is still the case. Quora probably uses sth like this in production where when a user enters a question it retrieves similar questions in case it has already been answered; a QuoraSTS task would no longer represent this task well I think. Also, as the snippet mentions, Quora is not even purely symmetric as the documents are question texts, while queries are the titles.
I don't think we have other symmetric retrieval tasks, but if we had, it could be worth having this in the metadata & maybe allow for separate leaderboards.
I think it's fine. It is a retrieval task.
You cannot prevent that organizations or people use subsets of BEIR/MTEB in marketing, without explaining what the dataset actually is.
thanks for @tomaarsen creating the issue!
In general i personal believe it's more towards to a STS task, but i get your point @Muennighoff ! I suggest let's leave it as it is and see if the discussion comes back in the future.
Thanks @tomaarsen!
While I did somewhat spark this discussion, I kind agree with everyone. It's not a super representative retrieval task, but I think it's valid to interpret it as both STS and retrieval.
As part of a varied benchmark, it doesn't bother me, and I even think it has great value there. Sadly, like @jobergum said, there's no way to stop anyone from cherry picking results for marketing purposes, so it shouldn't impact benchmark design decisions.
I think we're all mostly in agreeance that although it's perhaps not ideal; we should keep the dataset as-is. I also think there's not much benefit in a separate leaderboard for symmetric retrieval. For MTEB lite we can certainly remove it, though. It's one of the less interesting retrieval tasks, after all.
I'm happy to close this now.
Thanks for bringing up the discussion @tomaarsen, I will remove it from the MTEB lite pool of tasks to select from. Will close this issue for now, but do feel free to add any additional comments you might have.
Hello!
Although the Quora dataset (quora, quora-qrels) consists of queries like you'd find in retrieval, it really doesn't correspond with a normal query-passage retrieval task whatsoever. It corresponds much more closely with e.g. STS, although the dataset format doesn't match with it.
It's rather hard to argue to move one of the BEIR datasets out of the MTEB Retrieval tab (as right now we can nicely use that MTEB Retrieval == BEIR), but it's perhaps something to be mindful of.
See also a discussion here after the Quora dataset was used as a "retrieval benchmark" & its place in MTEB: https://x.com/jobergum/status/1809157587612336402
cc @jobergum @bclavie @bwanglzu