embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark
https://arxiv.org/abs/2210.07316
Apache License 2.0
1.92k stars 260 forks source link

Integrating FollowIR #321

Closed Muennighoff closed 5 months ago

Muennighoff commented 6 months ago

Moving our discussion on integrationg the cool FollowIR work here @orionw so others can chime in if they have thoughts :) Especially cc'ing @KennethEnevoldsen who knows MTEB internals better than me at this point!

@orionw I'll also add a PR for the instructions abstract task of MTEB - part of which involves adding rerankers to MTEB, which could warrant a more technical discussion about design (but perhaps best done when the PR is there so it's easy to reference).

@Muennighoff Had a quick look and it seems to me that Abstaskinstructionretrieval.py may be better handled on the user side similar to what's done in gritlm, where instructions are defined here https://github.com/ContextualAI/gritlm/blob/main/evaluation/eval_mteb.py#L10 & simply added to the text in the encode function https://github.com/ContextualAI/gritlm/blob/28d932b8703576736d6dd64d7cc7130895b2c49f/evaluation/eval_mteb.py#L1199 / https://github.com/ContextualAI/gritlm/blob/28d932b8703576736d6dd64d7cc7130895b2c49f/gritlm/gritlm.py#L117

But I may be missing something, feel free to either open a PR / issue / discussion for more technical discussion, whichever you prefer!

@orionw: The prefixes (which E5 and others refer to as "instructions") should definitely be in the model section, like you're suggesting. However, the instructions in FollowIR are query-specific and so they come from the data - it's comparable to instructions for normal LLMs, where they are user-given and specific to the instance, as opposed to the E5-like instructions which are dataset-level model prefixes applied to all inputs of that task/dataset. Because of that, they have to come from the data side (the abstract task) rather than from the model side.

Two overarching questions actually, before I make the PR:

  1. Given that FollowIR datasets use a different evaluation metric (p-MRR), requires new data features (instructions along with each query), and requires evaluation over the docs twice to compare original instruction vs the changed instruction, I made this a different abstract task (allowing for these and also implementing caching and reranking). But that might interfere with the leaderboard, as it would add a new column to MTEB that most models haven't evaluated on. What are your thoughts on that? If we don't want to add a new abstract task, I'm fine to not make a PR and leave the FollowIR evaluation a separate repo.
  2. I added an option for rerankers instead of DRESModel for this new abstract task. However, given computational issues, we can't use a reranker on the full BEIR datasets, so we'd either need to keep rerankers only to specific abstract tasks in MTEB or change MTEB to allow rerankers to rerank first stage model results for some/all abstract tasks. Does that make sense? I might lean towards rerankers with first stage results given as input, but that would require some significant alterations - which I'm happy to add but wanted to discuss

@Muennighoff Yes you're right, sorry! And we can also not just put them in the dataset and reuse everything as is, as some models treat instructions differently where they don't use the embedding over the instruction. In that case, your implementation makes the most sense. While it'd be great to add instructions in a general way that extends to any embedding task without needing AbsTaskInstructionClustering , AbsTaskInstructionClassification, I think that the highest priority for MTEB should be making it as easy to use & extend as possible. Thus similar to how models are duplicated in hf/transformers, it's fine to duplicate code for ease of use/extension!

  1. I think those are all okay! I would put the metric here ; the data features make sense too & do make a separate AbsTask like you have it reasonable ; the double evaluation is fine too and in that case a separate evaluator like you have it makes a lot of sense If you want, we can also add a new leaderboard tab for FollowIR! It would likely be under the Retrieval section: https://huggingface.co/spaces/mteb/leaderboard where we also just added "Law" ; I think this will give your great work more visibility as the leaderboard is viewed >100K times per month.
  2. That makes sense & I agree that rerankers should be able to do first-stage reranking. We also did a hacky addition of Reranking to MTEB for GritLM here - it's definitely something we should support as a first-class citizen but in a clean way!
  3. Another thought I had is that it's striking how all "embedding" models fail at following instructions in Table 3 of FollowIR. I think once you run GritLM with its generative capabilities, it will have a p-MRR that's similar to the other Instruct LMs at ~12 while its embedding p-MRR is still close to zero at 1.1. That'd reveal that even though it's the same model, it is much worse at instruction-following for embedding than generation! This would further support what you write that embedding instruction models really just rely on basic keywords & don't really "understand" their instruction. It reminds me of this issue & also how prompts in Instructor appear to be slightly tuned in weird ways. I think it'd be an interesting project to train an embedding IR model to be better at instruction-following by explicitly including such difficult/nuanced samples similar to your process of training FollowIR which is just a generative model afaict. Could also do both in one, i.e. a GRIT FollowIR model.
KennethEnevoldsen commented 6 months ago

Thanks for adding me to this. I haven't looked much at the reranking code, but generally, we have discussed allowing task-dependent encode functions using additional arguments (some more discussion on this in #215). Something like:

def encode(
    sentences: list[str], 
    encode_type: Literal["queries", "corpus"] | None # currently formalized in encode_corpus and encode_queries
    language: list[str] | None, # e.g. required for bitext mining. None can e.g. solve cases where language is unknown (e.g. lang clf)
    task: str # or some kind of metadata object - this allows e.g. embedding differently depending on clustering, retrieval etc.. It also allows for prompt-based encoding.
    )

Generally, this addition seems to require multiple PRs, so I would make sure to make them smaller adding one aspect at a time.

orionw commented 6 months ago

Thanks @KennethEnevoldsen and @Muennighoff!

If you want, we can also add a new leaderboard tab for FollowIR!

I was planning to make a leaderboard, so integrating into MTEB works great for me, less maintenance work. Do you have any pointers on how to do that - does the legal one have a PR somewhere?

That makes sense & I agree that rerankers should be able to do first-stage reranking. We also did a hacky addition of Reranking to MTEB for GritLM here - it's definitely something we should support as a first-class citizen but in a clean way!

Do you have suggestions for the best way to implement this (both @KennethEnevoldsen and @Muennighoff)? I made it an alternative to DRESModel, does that seem reasonable to you or did you have another suggestion? It seems like we just need one encode function, so I suppose we might be able to just re-use the existing class actually. We'd just need some flag to give the encode function which portions to rerank.

Re: which results to rerank, I'm thinking we either allow that as an option to be provided and/or make the default a BM25 run on them (which we can host the a BM25 top ranking dataset on Huggingface, like the InPars people did with BEIR). We can either make this an option to the encode function or require them to use the BM25 ones and not allow an option.

It reminds me of this https://github.com/ContextualAI/gritlm/issues/22 & also how prompts in Instructor appear to be slightly tuned in weird ways. I think it'd be an interesting project to train an embedding IR model to be better at instruction-following by explicitly including such difficult/nuanced samples similar to your process of training FollowIR which is just a generative model afaict. Could also do both in one, i.e. a GRIT FollowIR model.

These are interesting issues, thanks for pointing them out! I agree they support the "bi-encoder models do matching only" hypothesis!

And definitely agree about making a bi-encoder model that can follow instructions, that was my on my top next todo projects! I had some thoughts both in terms of data and architectures for that could work (definitely inspired from GritLM). Maybe a separate thread we can talk about.

Muennighoff commented 6 months ago

I was planning to make a leaderboard, so integrating into MTEB works great for me, less maintenance work. Do you have any pointers on how to do that - does the legal one have a PR somewhere?

Yes, this was the leaderboard PR https://huggingface.co/spaces/mteb/leaderboard/discussions/90, this the code PR https://github.com/embeddings-benchmark/mteb/pull/311. We also need a results PR, e.g. like this https://huggingface.co/datasets/mteb/results/discussions/27/files. Happy to help, e.g. I can do the LB one if you find it too complex.

Do you have suggestions for the best way to implement this (both @KennethEnevoldsen and @Muennighoff)? I made it an alternative to DRESModel, does that seem reasonable to you or did you have another suggestion? It seems like we just need one encode function, so I suppose we might be able to just re-use the existing class actually. We'd just need some flag to give the encode function which portions to rerank.

Sounds reasonable to me but curious what @KennethEnevoldsen thinks, too

Re: which results to rerank, I'm thinking we either allow that as an option to be provided and/or make the default a BM25 run on them (which we can host the a BM25 top ranking dataset on Huggingface, like the InPars people did with BEIR). We can either make this an option to the encode function or require them to use the BM25 ones and not allow an option.

I'd allow it as an option. The whole snippet here https://github.com/embeddings-benchmark/mteb/blob/5bd11fb1ee6c92931bef932692f3c61692ca21aa/mteb/abstasks/AbsTaskRetrieval.py#L282 already allows users to save the full results up to an optional top_k specified. I think we could just allow feeding these saved results back in for reranking 🧐 If we want we can also later host the BM25 ones but optional for a v1 I think. Let me know if you disagree / if I'm missing sth!

These are interesting issues, thanks for pointing them out! I agree they support the "bi-encoder models do matching only" hypothesis! And definitely agree about making a bi-encoder model that can follow instructions, that was my on my top next todo projects! I had some thoughts both in terms of data and architectures for that could work (definitely inspired from GritLM). Maybe a separate thread we can talk about.

Cool! Happy to help if I can be useful - maybe I can help with compute if you plan to scale up, as a bi-encoder may require more compute to train.

KennethEnevoldsen commented 6 months ago

Sounds reasonable to me but curious what @KennethEnevoldsen thinks, too

Without knowing the exact implementation, we would want to keep our models as clear as possible.

Atm. I believe we only require models with an encode method and optionally allow for the encode_queries or encode_corpus. Instead of replacing DRESModel, it might be better to consider an extension to the model if that makes sense. E.g. either if a model has method X we will use that instead or pass in an argument to one of the related functions which the model can use if they wish (e.g. see outline above).

orionw commented 6 months ago

Thanks so much for the links @Muennighoff! Will add a PR for the leaderboard and ask questions about it in a separate thread.

I think we could just allow feeding these saved results back in for reranking 🧐 If we want we can also later host the BM25 ones but optional for a v1 I think. Let me know if you disagree / if I'm missing sth!

Works for me! The only downside I see is that BM25 is not current implemented in MTEB and it's a common first-stage model, but that's a fixable problem :)

Atm. I believe we only require models with an encode method and optionally allow for the encode_queries or encode_corpus. Instead of replacing DRESModel, it might be better to consider an extension to the model if that makes sense. E.g. either if a model has method X we will use that instead or pass in an argument to one of the related functions which the model can use if they wish (e.g. see outline above).

I like that approach, thanks @KennethEnevoldsen! I'll start with a PR on FollowIR, then do a separate one for rerankers.

orionw commented 6 months ago

Thanks for both of your help! It's almost all integrated now!

Two questions about the last step, of the leaderboard:

  1. Is there a Github repo or way to make a PR for the leaderboard? I am having trouble finding it other than the HF space files, which IDK if they let you make a PR. I'd like to add the tab for instruction following
  2. Is there a way to get a link to that specific tab? Looks like Gradio enabled it for some components, but not sure if it already implemented in MTEB
Muennighoff commented 6 months ago
  1. These are the steps: a) Open a PR in https://huggingface.co/datasets/mteb/results with: a.1) All results added a.2) Update paths.json (see snippet results.py) a.3) If new models, their names added to results.py b) Open a PR at https://huggingface.co/spaces/mteb/leaderboard modifying app.py to add your tab b.1) Add new models & their specs to the global lists b.2) Add your tab, credits etc to where the other tabs are defined b.3) If you're adding new results to existing models, remove those models from EXTERNAL_RESULTS...json such that they can be reloaded with the new results and are not cached. b.4) You may also have to activate force_redownload at the point in the code where it is commented out to avoid caches b.5) Test that it runs & works locally as you desire with python app.py - Add screenshots to the PR

@KennethEnevoldsen what do you think about adding these instructions to MTEB? Maybe something like adding_a_leaderboard_tab.md

  1. Yes, e.g. https://huggingface.co/spaces/mteb/leaderboard?task=retrieval&language=law
KennethEnevoldsen commented 6 months ago

Sound great @Muennighoff, I would just add a link to the readme as well (similar to adding a dataset)

Muennighoff commented 5 months ago

Closing this one as done via https://github.com/embeddings-benchmark/mteb/pull/408 https://github.com/embeddings-benchmark/mteb/pull/555 https://huggingface.co/spaces/mteb/leaderboard/discussions/102

Feel free to reopen @orionw if this is premature