Closed Muennighoff closed 5 months ago
Thanks for adding me to this. I haven't looked much at the reranking code, but generally, we have discussed allowing task-dependent encode functions using additional arguments (some more discussion on this in #215). Something like:
def encode(
sentences: list[str],
encode_type: Literal["queries", "corpus"] | None # currently formalized in encode_corpus and encode_queries
language: list[str] | None, # e.g. required for bitext mining. None can e.g. solve cases where language is unknown (e.g. lang clf)
task: str # or some kind of metadata object - this allows e.g. embedding differently depending on clustering, retrieval etc.. It also allows for prompt-based encoding.
)
Generally, this addition seems to require multiple PRs, so I would make sure to make them smaller adding one aspect at a time.
Thanks @KennethEnevoldsen and @Muennighoff!
If you want, we can also add a new leaderboard tab for FollowIR!
I was planning to make a leaderboard, so integrating into MTEB works great for me, less maintenance work. Do you have any pointers on how to do that - does the legal one have a PR somewhere?
That makes sense & I agree that rerankers should be able to do first-stage reranking. We also did a hacky addition of Reranking to MTEB for GritLM here - it's definitely something we should support as a first-class citizen but in a clean way!
Do you have suggestions for the best way to implement this (both @KennethEnevoldsen and @Muennighoff)? I made it an alternative to DRESModel, does that seem reasonable to you or did you have another suggestion? It seems like we just need one encode function, so I suppose we might be able to just re-use the existing class actually. We'd just need some flag to give the encode function which portions to rerank.
Re: which results to rerank, I'm thinking we either allow that as an option to be provided and/or make the default a BM25 run on them (which we can host the a BM25 top ranking dataset on Huggingface, like the InPars people did with BEIR). We can either make this an option to the encode
function or require them to use the BM25 ones and not allow an option.
It reminds me of this https://github.com/ContextualAI/gritlm/issues/22 & also how prompts in Instructor appear to be slightly tuned in weird ways. I think it'd be an interesting project to train an embedding IR model to be better at instruction-following by explicitly including such difficult/nuanced samples similar to your process of training FollowIR which is just a generative model afaict. Could also do both in one, i.e. a GRIT FollowIR model.
These are interesting issues, thanks for pointing them out! I agree they support the "bi-encoder models do matching only" hypothesis!
And definitely agree about making a bi-encoder model that can follow instructions, that was my on my top next todo projects! I had some thoughts both in terms of data and architectures for that could work (definitely inspired from GritLM). Maybe a separate thread we can talk about.
I was planning to make a leaderboard, so integrating into MTEB works great for me, less maintenance work. Do you have any pointers on how to do that - does the legal one have a PR somewhere?
Yes, this was the leaderboard PR https://huggingface.co/spaces/mteb/leaderboard/discussions/90, this the code PR https://github.com/embeddings-benchmark/mteb/pull/311. We also need a results PR, e.g. like this https://huggingface.co/datasets/mteb/results/discussions/27/files. Happy to help, e.g. I can do the LB one if you find it too complex.
Do you have suggestions for the best way to implement this (both @KennethEnevoldsen and @Muennighoff)? I made it an alternative to DRESModel, does that seem reasonable to you or did you have another suggestion? It seems like we just need one encode function, so I suppose we might be able to just re-use the existing class actually. We'd just need some flag to give the encode function which portions to rerank.
Sounds reasonable to me but curious what @KennethEnevoldsen thinks, too
Re: which results to rerank, I'm thinking we either allow that as an option to be provided and/or make the default a BM25 run on them (which we can host the a BM25 top ranking dataset on Huggingface, like the InPars people did with BEIR). We can either make this an option to the encode function or require them to use the BM25 ones and not allow an option.
I'd allow it as an option. The whole snippet here https://github.com/embeddings-benchmark/mteb/blob/5bd11fb1ee6c92931bef932692f3c61692ca21aa/mteb/abstasks/AbsTaskRetrieval.py#L282 already allows users to save the full results up to an optional top_k specified. I think we could just allow feeding these saved results back in for reranking 🧐 If we want we can also later host the BM25 ones but optional for a v1 I think. Let me know if you disagree / if I'm missing sth!
These are interesting issues, thanks for pointing them out! I agree they support the "bi-encoder models do matching only" hypothesis! And definitely agree about making a bi-encoder model that can follow instructions, that was my on my top next todo projects! I had some thoughts both in terms of data and architectures for that could work (definitely inspired from GritLM). Maybe a separate thread we can talk about.
Cool! Happy to help if I can be useful - maybe I can help with compute if you plan to scale up, as a bi-encoder may require more compute to train.
Sounds reasonable to me but curious what @KennethEnevoldsen thinks, too
Without knowing the exact implementation, we would want to keep our models as clear as possible.
Atm. I believe we only require models with an encode method and optionally allow for the encode_queries or encode_corpus. Instead of replacing DRESModel, it might be better to consider an extension to the model if that makes sense. E.g. either if a model has method X we will use that instead or pass in an argument to one of the related functions which the model can use if they wish (e.g. see outline above).
Thanks so much for the links @Muennighoff! Will add a PR for the leaderboard and ask questions about it in a separate thread.
I think we could just allow feeding these saved results back in for reranking 🧐 If we want we can also later host the BM25 ones but optional for a v1 I think. Let me know if you disagree / if I'm missing sth!
Works for me! The only downside I see is that BM25 is not current implemented in MTEB and it's a common first-stage model, but that's a fixable problem :)
Atm. I believe we only require models with an encode method and optionally allow for the encode_queries or encode_corpus. Instead of replacing DRESModel, it might be better to consider an extension to the model if that makes sense. E.g. either if a model has method X we will use that instead or pass in an argument to one of the related functions which the model can use if they wish (e.g. see outline above).
I like that approach, thanks @KennethEnevoldsen! I'll start with a PR on FollowIR, then do a separate one for rerankers.
Thanks for both of your help! It's almost all integrated now!
Two questions about the last step, of the leaderboard:
EXTERNAL_RESULTS...json
such that they can be reloaded with the new results and are not cached.
b.4) You may also have to activate force_redownload
at the point in the code where it is commented out to avoid caches
b.5) Test that it runs & works locally as you desire with python app.py
- Add screenshots to the PR@KennethEnevoldsen what do you think about adding these instructions to MTEB? Maybe something like adding_a_leaderboard_tab.md
Sound great @Muennighoff, I would just add a link to the readme as well (similar to adding a dataset
)
Closing this one as done via https://github.com/embeddings-benchmark/mteb/pull/408 https://github.com/embeddings-benchmark/mteb/pull/555 https://huggingface.co/spaces/mteb/leaderboard/discussions/102
Feel free to reopen @orionw if this is premature
Moving our discussion on integrationg the cool FollowIR work here @orionw so others can chime in if they have thoughts :) Especially cc'ing @KennethEnevoldsen who knows MTEB internals better than me at this point!
@orionw I'll also add a PR for the instructions abstract task of MTEB - part of which involves adding rerankers to MTEB, which could warrant a more technical discussion about design (but perhaps best done when the PR is there so it's easy to reference).
@Muennighoff Had a quick look and it seems to me that Abstaskinstructionretrieval.py may be better handled on the user side similar to what's done in gritlm, where instructions are defined here https://github.com/ContextualAI/gritlm/blob/main/evaluation/eval_mteb.py#L10 & simply added to the text in the encode function https://github.com/ContextualAI/gritlm/blob/28d932b8703576736d6dd64d7cc7130895b2c49f/evaluation/eval_mteb.py#L1199 / https://github.com/ContextualAI/gritlm/blob/28d932b8703576736d6dd64d7cc7130895b2c49f/gritlm/gritlm.py#L117
But I may be missing something, feel free to either open a PR / issue / discussion for more technical discussion, whichever you prefer!
@orionw: The prefixes (which E5 and others refer to as "instructions") should definitely be in the model section, like you're suggesting. However, the instructions in FollowIR are query-specific and so they come from the data - it's comparable to instructions for normal LLMs, where they are user-given and specific to the instance, as opposed to the E5-like instructions which are dataset-level model prefixes applied to all inputs of that task/dataset. Because of that, they have to come from the data side (the abstract task) rather than from the model side.
Two overarching questions actually, before I make the PR:
@Muennighoff Yes you're right, sorry! And we can also not just put them in the dataset and reuse everything as is, as some models treat instructions differently where they don't use the embedding over the instruction. In that case, your implementation makes the most sense. While it'd be great to add instructions in a general way that extends to any embedding task without needing AbsTaskInstructionClustering , AbsTaskInstructionClassification, I think that the highest priority for MTEB should be making it as easy to use & extend as possible. Thus similar to how models are duplicated in hf/transformers, it's fine to duplicate code for ease of use/extension!