adding support for decoder-only LLMs e.g. E5-Mistral

Krilecy commented 7 months ago

Hi, I worked on a project using the E5-Mistral 7B model (accompanying paper) that currently tops the MTEB. This repo saved me a lot of work! Thanks!

Since the authors didn't publish the code I implemented a decoder-only model class for my project. Should I open a PR? Is anybody already working on this? Should this be included?

Need some guidance from the maintainers here on if this is desired.

I would clean it up a bit for a PR.


class DecoderOnlyLM:
    def __init__(self, model_path: str = None):
        # Query tokenizer and model
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        self.model = AutoModel.from_pretrained(model_path, output_attentions=True)
        self.model.eval()
        self.max_length = 4096

    def last_token_pool(self, last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
        # Pooling EOS tokens as embeddings
        left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
        if left_padding:
            return last_hidden_states[:, -1]
        else:
            sequence_lengths = attention_mask.sum(dim=1) - 1
            batch_size = last_hidden_states.shape[0]
            return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]

    def encode_queries(self, queries: List[str], batch_size: int = 1) -> torch.Tensor:
        query_embeddings = []
        with torch.no_grad():
            for start_idx in trange(0, len(queries), batch_size):
                encoded = self.tokenizer(queries[start_idx:start_idx+batch_size], truncation=True, padding=True, return_tensors='pt', max_length=self.max_length)
                model_out = self.model(encoded['input_ids'], attention_mask=encoded['attention_mask']) # .cuda()
                query_embeddings = self.last_token_pool(model_out.last_hidden_state, encoded['attention_mask'])
                query_embeddings = F.normalize(query_embeddings, p=2, dim=1)

        return torch.stack(query_embeddings)

    def encode_corpus(self, corpus: List[Dict[str, str]], batch_size: int = 1) -> torch.Tensor:

        corpus_embeddings = []
        with torch.no_grad():
            for start_idx in trange(0, len(corpus), batch_size):
                titles = [row['title'] for row in corpus[start_idx:start_idx+batch_size]]
                texts = [row['text']  for row in corpus[start_idx:start_idx+batch_size]]
                encoded = self.tokenizer(titles, texts, truncation='longest_first', padding=True, return_tensors='pt')
                model_out = self.model(encoded['input_ids'].cuda(), attention_mask=encoded['attention_mask'].cuda())
                embeddings = self.last_token_pool(model_out.last_hidden_state, encoded['attention_mask'])
                corpus_embeddings += embeddings.detach()
                corpus_embeddings = F.normalize(corpus_embeddings, p=2, dim=1)

        return torch.stack(corpus_embeddings)

thakur-nandan commented 7 months ago

Hi @Krilecy,

Thanks for the suggestion and feel free to add a PR for decoder-only LM for retrieval.

Try to evaluate a whole example (let's say on NfCorpus) and compare your evaluation scores against the ones reported in the paper.

Thanks, Nandan

Krilecy commented 7 months ago

Hi, when I opened this issue I had not seen that MS Research already published an eval script. I (and others) should use that one to keep results consistent and comparable. Sorry, my bad.

beir-cellar / beir

adding support for decoder-only LLMs e.g. E5-Mistral #164