Closed dhruvbpai closed 3 weeks ago
@dhruvbpai a simple solution might be to wrap the encoder:
my_encoder = ...
class EncoderWrapper()
def __init__(self, encoder):
encoder = encoder
def encode(sentences, **kwargs):
# potentially load from disk here
emb = self.encoder.encode()
# save to disk here
return emb
A setup like this is fully supported within MTEB.
@KennethEnevoldsen Thank you for your response. My aim was to save embeddings for corpus and query embeddings separately, with the split and task in the naming convention. Unfortunately, the encoder doesn't get access to the name of the task it is being evaluated on - which I need in order to save correctly. A simple fix would be to make this available to the encoder somehow alternatively.
This has actually been discussed as a part of #216 where it was pretty close to a merge but sadly never got finished.
I would suggest re-using the sentence-transformers prompt_name syntax (see sentence transformer encode args).
def encode(
self, sentences: list[str], prompt_name: str, **kwargs: Any
)
"""
...
prompt_name: str: Optional argument. MTEB will provide the task name during encode.
This allows for task-specific prompts or other types of task-dependent encodes such as encoding
depending on e.g. clustering and retrieval.
"""
...
I would be very happy to see a PR on this.
I'm working on this PR now and it occurred to me it may be better to pass directly the task metadata dict instead of a string, since this would maximize flexibility and reduce complexity. What are your thoughts @KennethEnevoldsen?
Sorry for missing this @dhruvbpai
I would just use the task_name and the you can fetch the task using:
task = mteb.get_task("name")
meta = task.metadata
This is to keep it consistent with sentence transformers.
as a part of https://github.com/embeddings-benchmark/mteb/pull/216 where it was pretty close to a merge but sadly never got finished.
Actually, I would be happy to revive #216, but I would need you, the MTEB maintainers, to agree on the interface to do so before I start re-implementing it.
Hi @avidale. We have actually added task conditional encoding in #888. Which allows for the encoding as stated above. This makes it possible to create prompts based on tasks (e.g for the instruct e5 models). However, you might just as well use it to
def encode(sentences, prompt_name):
task = mteb.get_task(prompt_name)
langs = task.metadata.languages
# encode text based on languages
The one problem here is multilingual tasks (e.g. dan, eng, fra) where a task can have multiple languages (atm. the model can't know if we are currently encoding eng, fra, or dan). We could still add this.
Will close this issue for now - feel free to reopen in required
@gmittal Currently, the
save_predictions
flag allows for the saving of query similarity predictions to a json file. However, I wish to have a separate flag to save the embeddings computed, say as a torch tensor, pkl, or json file for example. I believe this feature would allow for greater flexibility.https://github.com/embeddings-benchmark/mteb/blob/main/mteb/abstasks/AbsTaskRetrieval.py#L303-L306
I can work on this issue myself if needed, but I wanted to verify such a feature was within scope for mteb.
(My work focuses on retrieval, I am somewhat less familiar with the abstask setup for other tasks but from what I can tell they are similar in that saving embeddings should definitely be possible)