embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark
https://arxiv.org/abs/2210.07316
Apache License 2.0
1.84k stars 247 forks source link

Confusion re: Retrieval w/Instructions #1013

Open Muennighoff opened 3 months ago

Muennighoff commented 3 months ago

We're adding RAR-b with & without instructions as two leaderboard tabs under Retrieval with @gowitheflow-1998. Naming-wise it is confusing to have these and also the Retrieval w/ Instructions tab. In general, that tab name is a bit confusing as many models use instructions for retrieval tasks in the regular retrieval tab. I don't have an idea yet for how to make it better though, but in case someone does let us know! 😁

KennethEnevoldsen commented 3 months ago

Will just add @orionw here as well.

I think I would call it Instruction Retrieval instead of retrieval with instructions – sounds like just retrieval with a (general) prompt.

orionw commented 3 months ago

I think I would call it Instruction Retrieval instead of retrieval with instructions – sounds like just retrieval with a (general) prompt.

I didn't choose this name because there is already a category of retrieval tasks where you retrieve examples/instructions for your prompts -- I was worried people would assume it was literally retrieving instructions for your prompt to GPT, not using instructions to retrieve.

If I'm wrong and people don't associate instruction retrieval with retrieving instructions we can take that name over, but that was my concern.

In general, that tab name is a bit confusing as many models use instructions for retrieval tasks in the regular retrieval tab.

I think this is a general confusion as the field is changing and I think the lines are not well-defined :)

I think the main difference IMO between these two types of tasks is the importance of the instructions to the task. Most models from the last six months use instructions, mostly model-creator written (E5, etc.), but the instructions are either dataset-level or all retrieval-level (for some models like BGE) and thus vague and mostly just given to the retrieval model as an extra boost in performance. For example, you don't need instructions for SciFact to do well on SciFact -- it's just extra information you're giving to the model in the hopes it helps.

FollowIR, InstructIR, Rar-b and others have instructions that are crucial for the model. If you take the instructions away the task pretty much falls aparts: for FollowIR/InstructIR the instructions literally define document relevance. For Rar-b/Birco -- given the datasets involved -- the query-document relationship is very different from what standard models are trained on, so without the instructions it's nearly impossible to expect the models to handle that mapping without instructions.

I don't have a great alternative name offhand, it seems kinda clunky to name it something like Retrieval w/Crucial Instructions or Retrieval that needs Instructions

Muennighoff commented 3 months ago

Great explanation!

FollowIR, InstructIR, Rar-b and others have instructions that are crucial for the model. If you take the instructions away the task pretty much falls aparts: for FollowIR/InstructIR the instructions literally define document relevance. For Rar-b/Birco -- given the datasets involved -- the query-document relationship is very different from what standard models are trained on, so without the instructions it's nearly impossible to expect the models to handle that mapping without instructions.

Agree - hence maybe a bit confusing that Rar-b ends up under the regular Retrieval tab while FollowIR ends up under the Retrieval w/ Instructions tab 🤔 But we can just leave this open for now & see if someone has a better idea. Maybe once we revamp the leaderboard for the filtering changes & co, this will get solved with that together.

orionw commented 3 months ago

Agree - hence maybe a bit confusing that Rar-b ends up under the regular Retrieval tab while FollowIR ends up under the Retrieval w/ Instructions tab 🤔

Yeah this stemmed from the fact that FollowIR and InstructIR have query-specific instructions (e.g. not just one instruction for the full dataset but each instruction is only usable with the given query) and so the setup for the class needed to have the additional data for the instance-level query-specific instructions. Hence the new AbstractTask. Rar-b uses dataset-level annotations and thus fits in the existing RetrievalAbstract task.

Maybe once we revamp the leaderboard for the filtering changes & co, this will get solved with that together.

I think this is the main issue and is mostly a leaderboard thing. We can't group tasks together across abstract classes currently. Once we fix the leaderboard we should combine them.

KennethEnevoldsen commented 3 months ago

I think this is the main issue and is mostly a leaderboard thing. We can't group tasks together across abstract classes currently. Once we fix the leaderboard we should combine them.

Yea hopefully the new leaderboard format will make it more clear

gowitheflow-1998 commented 3 months ago

FollowIR, InstructIR, Rar-b and others have instructions that are crucial for the model. If you take the instructions away the task pretty much falls aparts: for FollowIR/InstructIR the instructions literally define document relevance. For Rar-b/Birco -- given the datasets involved -- the query-document relationship is very different from what standard models are trained on, so without the instructions it's nearly impossible to expect the models to handle that mapping without instructions.

very much agree about classifying whether should be a retrieval w/ instruction task by looking at how crucial the instructions are to the task - I made similar argument throughout the RAR-b paper @orionw very great explanation! And the similar idea has been helping us conceptualizing about latest multimodal retrieval tasks, in terms of task diffuculties and reliance on instructions for an image-text retrieval task to actually make sense. This is relevant to the vision project as well. @Muennighoff @KennethEnevoldsen

It makes sense to group by AbsTask so now (I'm happy with RAR-b falling under either) until someone has better ideas!