Add support for saving embeddings in evals

embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark

https://arxiv.org/abs/2210.07316

Apache License 2.0

1.85k stars 249 forks source link

Add support for saving embeddings in evals #824

Closed dhruvbpai closed 3 weeks ago

dhruvbpai commented 4 months ago

@gmittal Currently, the save_predictions flag allows for the saving of query similarity predictions to a json file. However, I wish to have a separate flag to save the embeddings computed, say as a torch tensor, pkl, or json file for example. I believe this feature would allow for greater flexibility.

https://github.com/embeddings-benchmark/mteb/blob/main/mteb/abstasks/AbsTaskRetrieval.py#L303-L306

I can work on this issue myself if needed, but I wanted to verify such a feature was within scope for mteb.

(My work focuses on retrieval, I am somewhat less familiar with the abstask setup for other tasks but from what I can tell they are similar in that saving embeddings should definitely be possible)

KennethEnevoldsen commented 4 months ago

@dhruvbpai a simple solution might be to wrap the encoder:

my_encoder = ...

class EncoderWrapper()
  def __init__(self, encoder):
    encoder = encoder

  def encode(sentences, **kwargs):
    # potentially load from disk here 
    emb = self.encoder.encode()
    # save to disk here
    return emb

A setup like this is fully supported within MTEB.

dhruvbpai commented 4 months ago

@KennethEnevoldsen Thank you for your response. My aim was to save embeddings for corpus and query embeddings separately, with the split and task in the naming convention. Unfortunately, the encoder doesn't get access to the name of the task it is being evaluated on - which I need in order to save correctly. A simple fix would be to make this available to the encoder somehow alternatively.

KennethEnevoldsen commented 4 months ago

This has actually been discussed as a part of #216 where it was pretty close to a merge but sadly never got finished.

I would suggest re-using the sentence-transformers prompt_name syntax (see sentence transformer encode args).

def encode(
         self, sentences: list[str], prompt_name: str, **kwargs: Any
     )
     """
     ...
        prompt_name: str: Optional argument. MTEB will provide the task name during encode. 
           This allows for task-specific prompts or other types of task-dependent encodes such as encoding 
           depending on e.g. clustering and retrieval.  
     """
   ...

I would be very happy to see a PR on this.

dhruvbpai commented 4 months ago

I'm working on this PR now and it occurred to me it may be better to pass directly the task metadata dict instead of a string, since this would maximize flexibility and reduce complexity. What are your thoughts @KennethEnevoldsen?

KennethEnevoldsen commented 4 months ago

Sorry for missing this @dhruvbpai

I would just use the task_name and the you can fetch the task using:

task = mteb.get_task("name")
meta = task.metadata

This is to keep it consistent with sentence transformers.

avidale commented 3 months ago

as a part of https://github.com/embeddings-benchmark/mteb/pull/216 where it was pretty close to a merge but sadly never got finished.

Actually, I would be happy to revive #216, but I would need you, the MTEB maintainers, to agree on the interface to do so before I start re-implementing it.

KennethEnevoldsen commented 3 months ago

Hi @avidale. We have actually added task conditional encoding in #888. Which allows for the encoding as stated above. This makes it possible to create prompts based on tasks (e.g for the instruct e5 models). However, you might just as well use it to

def encode(sentences, prompt_name):
  task = mteb.get_task(prompt_name)
  langs = task.metadata.languages
  # encode text based on languages

The one problem here is multilingual tasks (e.g. dan, eng, fra) where a task can have multiple languages (atm. the model can't know if we are currently encoding eng, fra, or dan). We could still add this.

KennethEnevoldsen commented 3 weeks ago

Will close this issue for now - feel free to reopen in required