FlagOpen / FlagEmbedding

Retrieval and Retrieval-augmented LLMs
MIT License
6.21k stars 446 forks source link

Very slow to encode even with sparse/colbert disabled #456

Open Potrock opened 5 months ago

Potrock commented 5 months ago

Running on M2 Max Macbook Pro with 32GB memory, inference only.

Have found that encoding is very slow compared to BGE-Large-en-v1.5 (couple hundred ms on BGE-Large and 12+ seconds on BGE-M3). The encoding speed is the same even if I disable the Colbert and Sparse models.

model.encode(max_length=100, sentences=["What is BGE M3?"], batch_size=1, return_sparse=False, return_dense=True, return_colbert_vecs=False)
encoding:   0%|          | 0/1 [00:00<?, ?it/s]
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
encoding: 100%|██████████| 1/1 [00:12<00:00, 12.19s/it]
{'dense_vecs': None,
 'lexical_weights': [defaultdict(int,
              {'4865': 0.0838,
               '83': 0.08154,
               '335': 0.1299,
               '11679': 0.252,
               '276': 0.1702,
               '363': 0.2698,
               '32': 0.0408})],
 'colbert_vecs': None}

Whereas with BGE-Large-en-v1.5:

from FlagEmbedding import FlagModel
import time
model = FlagModel('BAAI/bge-large-en-v1.5')
start = time.time()
model.encode("What is BGE M3?")
end = time.time()
print(end - start)
0.06534576416015625

Am I doing something wrong?

Potrock commented 5 months ago

For what it's worth, if I encode 3000 sentences with batch size 32 it takes on average 50ms per sentence.

staoxiao commented 5 months ago

Hi, the code you used is to generate lexical weight (dense_vecs and colbert_vecs both is None), which needs some time to process the weight for each token: https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/bge_m3.py#L156

And in the implementation of m3, we change the method to generate batch data, which may influence the latency. We will check it.

Potrock commented 5 months ago

Hi, the code you used is to generate lexical weight (dense_vecs and colbert_vecs both is None), which needs some time to process the weight for each token: https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/bge_m3.py#L156

And in the implementation of m3, we change the method to generate batch data, which may influence the latency. We will check it.

Thanks for the quick reply. I pasted the wrong output but even with only dense vectors being computed, it still takes 12 seconds to encode.

tim-inkeep commented 5 months ago

@staoxiao Do you have any recommendations of easy ways to increase performance of embedding times? Are there any changes we can try to implement? Very willing to do any work that is necessary to help out. We have tried playing with the max_length and batch_sizes, but would a quantization of some sort help?

staoxiao commented 5 months ago

Hi, @tim-inkeep , you can try to use the tool from huggingface: https://github.com/huggingface/text-embeddings-inference. Converting the model to other formats (e.g., onnx) is also helpful.

tim-inkeep commented 5 months ago

Hey @staoxiao I changed the code for the encoding to follow the same batching as the compute_score method and it seems to speed it up significantly, is there any reason why not to do this?

` def encode(self, sentences: Union[List[str], str], batch_size: int = 12, max_length: int = 8192, return_dense: bool = True, return_sparse: bool = False, return_colbert_vecs: bool = False) -> Dict:

    if self.num_gpus > 1:
        batch_size *= self.num_gpus
    self.model.eval()

    input_was_string = False
    if isinstance(sentences, str):
        sentences = [sentences]
        input_was_string = True        

    all_dense_embeddings, all_lexical_weights, all_colbert_vec = [], [], []
    for start_index in tqdm(range(0, len(sentences), batch_size), desc='encoding', mininterval=10):

        sentences_batch = sentences[start_index:start_index + batch_size]
        batch_data = self.tokenizer(sentences_batch, return_tensors="pt", max_length=max_length,
            padding=True,
            return_token_type_ids=False,
            truncation=True,)
        batch_data = batch_data.to(self.device)

        output = self.model(batch_data,
                            return_dense=return_dense,
                            return_sparse=return_sparse,
                            return_colbert=return_colbert_vecs)
        if return_dense:
            all_dense_embeddings.append(output['dense_vecs'].cpu().numpy())

        if return_sparse:
            token_weights = output['sparse_vecs'].squeeze(-1)
            all_lexical_weights.extend(list(map(self._process_token_weights, token_weights.cpu().numpy(),
                                                batch_data['input_ids'].cpu().numpy().tolist())))

        if return_colbert_vecs:
            all_colbert_vec.extend(list(map(self._process_colbert_vecs, output['colbert_vecs'].cpu().numpy(),
                                            batch_data['attention_mask'].cpu().numpy())))

    if return_dense:
        all_dense_embeddings = np.concatenate(all_dense_embeddings, axis=0)

    if return_dense:
        if input_was_string:
            all_dense_embeddings = all_dense_embeddings[0]
    else:
        all_dense_embeddings = None

    if return_sparse:
        if input_was_string:
            all_lexical_weights = all_lexical_weights[0]
    else:
        all_lexical_weights = None

    if return_colbert_vecs:
        if input_was_string:
            all_colbert_vec = all_colbert_vec[0]
    else:
        all_colbert_vec = None

    return {"dense_vecs": all_dense_embeddings, "lexical_weights": all_lexical_weights,
            "colbert_vecs": all_colbert_vec}`
Potrock commented 5 months ago

Hey @staoxiao I changed the code for the encoding to follow the same batching as the compute_score method and it seems to speed it up significantly, is there any reason why not to do this?

` def encode(self, sentences: Union[List[str], str], batch_size: int = 12, max_length: int = 8192, return_dense: bool = True, return_sparse: bool = False, return_colbert_vecs: bool = False) -> Dict:

    if self.num_gpus > 1:
        batch_size *= self.num_gpus
    self.model.eval()

    input_was_string = False
    if isinstance(sentences, str):
        sentences = [sentences]
        input_was_string = True        

    all_dense_embeddings, all_lexical_weights, all_colbert_vec = [], [], []
    for start_index in tqdm(range(0, len(sentences), batch_size), desc='encoding', mininterval=10):

        sentences_batch = sentences[start_index:start_index + batch_size]
        batch_data = self.tokenizer(sentences_batch, return_tensors="pt", max_length=max_length,
            padding=True,
            return_token_type_ids=False,
            truncation=True,)
        batch_data = batch_data.to(self.device)

        output = self.model(batch_data,
                            return_dense=return_dense,
                            return_sparse=return_sparse,
                            return_colbert=return_colbert_vecs)
        if return_dense:
            all_dense_embeddings.append(output['dense_vecs'].cpu().numpy())

        if return_sparse:
            token_weights = output['sparse_vecs'].squeeze(-1)
            all_lexical_weights.extend(list(map(self._process_token_weights, token_weights.cpu().numpy(),
                                                batch_data['input_ids'].cpu().numpy().tolist())))

        if return_colbert_vecs:
            all_colbert_vec.extend(list(map(self._process_colbert_vecs, output['colbert_vecs'].cpu().numpy(),
                                            batch_data['attention_mask'].cpu().numpy())))

    if return_dense:
        all_dense_embeddings = np.concatenate(all_dense_embeddings, axis=0)

    if return_dense:
        if input_was_string:
            all_dense_embeddings = all_dense_embeddings[0]
    else:
        all_dense_embeddings = None

    if return_sparse:
        if input_was_string:
            all_lexical_weights = all_lexical_weights[0]
    else:
        all_lexical_weights = None

    if return_colbert_vecs:
        if input_was_string:
            all_colbert_vec = all_colbert_vec[0]
    else:
        all_colbert_vec = None

    return {"dense_vecs": all_dense_embeddings, "lexical_weights": all_lexical_weights,
            "colbert_vecs": all_colbert_vec}`

Had to pull out the _process_colbert_vecs and _process_token_weights functions and add self to their args but with your data loading I'm able to encode 15 strings per second with all 3 embeddings enabled.

Potrock commented 5 months ago

Here's the full file if you want to give it a shot and check speed with the other embedding methods enabled? @tim-inkeep

from typing import cast, List, Union, Tuple, Optional, Dict
import numpy as np
from collections import defaultdict
import torch
from tqdm import tqdm
import datasets
from transformers import PreTrainedTokenizerFast, BatchEncoding, DataCollatorWithPadding, XLMRobertaForMaskedLM
from torch.utils.data import DataLoader
from functools import partial
from FlagEmbedding.BGE_M3 import BGEM3ForInference

def _transform_func(examples: Dict[str, List],
                    tokenizer: PreTrainedTokenizerFast,
                    max_length: int = 8192,
                    ) -> BatchEncoding:
    inputs = tokenizer(examples['text'],
                       max_length=max_length,
                       padding=True,
                       return_token_type_ids=False,
                       truncation=True,
                       return_tensors='pt')
    return inputs

class BGEM3FlagModel:
    def __init__(
            self,
            model_name_or_path: str = None,
            pooling_method: str = 'cls',
            normalize_embeddings: bool = True,
            use_fp16: bool = True,
            device: str = None
    ) -> None:

        self.model = BGEM3ForInference(
            model_name=model_name_or_path,
            normlized=normalize_embeddings,
            sentence_pooling_method=pooling_method,
        )

        self.tokenizer = self.model.tokenizer
        if device:
            self.device = torch.device(device)
        else:
            if torch.cuda.is_available():
                self.device = torch.device("cuda")
            elif torch.backends.mps.is_available():
                self.device = torch.device("mps")
            else:
                self.device = torch.device("cpu")
                use_fp16 = False
        if use_fp16: self.model.half()
        self.model = self.model.to(self.device)

        if device is None:
            self.num_gpus = torch.cuda.device_count()
            if self.num_gpus > 1:
                print(f"----------using {self.num_gpus}*GPUs----------")
                self.model.model = torch.nn.DataParallel(self.model.model)
        else:
            self.num_gpus = 1

        self.model.eval()

    def convert_id_to_token(self, lexical_weights: List[Dict]):
        if isinstance(lexical_weights, dict):
            lexical_weights = [lexical_weights]
        new_lexical_weights = []
        for item in lexical_weights:
            new_item = {}
            for id, weight in item.items():
                token = self.tokenizer.decode([int(id)])
                new_item[token] = weight
            new_lexical_weights.append(new_item)

        if len(new_lexical_weights) == 1:
            new_lexical_weights = new_lexical_weights[0]
        return new_lexical_weights

    def compute_lexical_matching_score(self, lexical_weights_1: Dict, lexical_weights_2: Dict):
        scores = 0
        for token, weight in lexical_weights_1.items():
            if token in lexical_weights_2:
                scores += weight * lexical_weights_2[token]
        return scores

    def colbert_score(self, q_reps, p_reps):
        q_reps, p_reps = torch.from_numpy(q_reps), torch.from_numpy(p_reps)
        token_scores = torch.einsum('in,jn->ij', q_reps, p_reps)
        scores, _ = token_scores.max(-1)
        scores = torch.sum(scores) / q_reps.size(0)
        return scores

    def _process_token_weights(self, token_weights: np.ndarray, input_ids: list):
            # conver to dict
        result = defaultdict(int)
        unused_tokens = set([self.tokenizer.cls_token_id, self.tokenizer.eos_token_id, self.tokenizer.pad_token_id,
                                 self.tokenizer.unk_token_id])
        # token_weights = np.ceil(token_weights * 100)
        for w, idx in zip(token_weights, input_ids):
            if idx not in unused_tokens and w > 0:
                idx = str(idx)
                    # w = int(w)
                if w > result[idx]:
                    result[idx] = w
        return result

    def _process_colbert_vecs(self, colbert_vecs: np.ndarray, attention_mask: list):
        # delte the vectors of padding tokens
        tokens_num = np.sum(attention_mask)
        return colbert_vecs[:tokens_num - 1]  # we don't use the embedding of cls, so select tokens_num-1

    @torch.no_grad()
    def encode(self,
               sentences: Union[List[str], str],
               batch_size: int = 12,
               max_length: int = 8192,
               return_dense: bool = True,
               return_sparse: bool = False,
               return_colbert_vecs: bool = False) -> Dict:

        if self.num_gpus > 1:
            batch_size *= self.num_gpus
        self.model.eval()

        input_was_string = False
        if isinstance(sentences, str):
            sentences = [sentences]
            input_was_string = True

        all_dense_embeddings, all_lexical_weights, all_colbert_vec = [], [], []
        for start_index in tqdm(range(0, len(sentences), batch_size), desc='encoding', mininterval=10):
            sentences_batch = sentences[start_index:start_index + batch_size]
            batch_data = self.tokenizer(sentences_batch, return_tensors="pt", max_length=max_length,
                padding=True,
                return_token_type_ids=False,
                truncation=True,)
            batch_data = batch_data.to(self.device)

            output = self.model(batch_data,
                                return_dense=return_dense,
                                return_sparse=return_sparse,
                                return_colbert=return_colbert_vecs)
            if return_dense:
                all_dense_embeddings.append(output['dense_vecs'].cpu().numpy())

            if return_sparse:
                token_weights = output['sparse_vecs'].squeeze(-1)
                all_lexical_weights.extend(list(map(self._process_token_weights, token_weights.cpu().numpy(),
                                                    batch_data['input_ids'].cpu().numpy().tolist())))

            if return_colbert_vecs:
                all_colbert_vec.extend(list(map(self._process_colbert_vecs, output['colbert_vecs'].cpu().numpy(),
                                                batch_data['attention_mask'].cpu().numpy())))

        if return_dense:
            all_dense_embeddings = np.concatenate(all_dense_embeddings, axis=0)

        if return_dense:
            if input_was_string:
                all_dense_embeddings = all_dense_embeddings[0]
        else:
            all_dense_embeddings = None

        if return_sparse:
            if input_was_string:
                all_lexical_weights = all_lexical_weights[0]
        else:
            all_lexical_weights = None

        if return_colbert_vecs:
            if input_was_string:
                all_colbert_vec = all_colbert_vec[0]
        else:
            all_colbert_vec = None

        return {"dense_vecs": all_dense_embeddings, "lexical_weights": all_lexical_weights,
                    "colbert_vecs": all_colbert_vec}

    @torch.no_grad()
    def compute_score(self,
                      sentence_pairs: Union[List[Tuple[str, str]], Tuple[str, str]],
                      batch_size: int = 256,
                      max_query_length: int = 512,
                      max_passage_length: int = 8192,
                      weights_for_different_modes: List[float] = None) -> Dict[str, List[float]]:

        def _tokenize(texts: list, max_length: int):
            return self.tokenizer(
                texts,
                max_length=max_length,
                padding=True,
                return_token_type_ids=False,
                truncation=True,
                return_tensors='pt'
            )

        if self.num_gpus > 0:
            batch_size *= self.num_gpus
        self.model.eval()
        if isinstance(sentence_pairs, list) and len(sentence_pairs) == 0:
            return []
        if isinstance(sentence_pairs[0], str):
            one_input_pair = True
            sentence_pairs = [sentence_pairs]
        else:
            one_input_pair = False

        all_scores = {
            'colbert': [],
            'sparse': [],
            'dense': [],
            'sparse+dense': [],
            'colbert+sparse+dense': []
        }
        for start_index in tqdm(range(0, len(sentence_pairs), batch_size), desc="Compute Scores",
                                disable=len(sentence_pairs) < 128):
            sentences_batch = sentence_pairs[start_index:start_index + batch_size]

            queries_batch = [pair[0] for pair in sentences_batch]
            corpus_batch = [pair[1] for pair in sentences_batch]

            queries_inputs = _tokenize(queries_batch, max_length=max_query_length).to(self.device)
            corpus_inputs = _tokenize(corpus_batch, max_length=max_passage_length).to(self.device)

            queries_output = self.model(queries_inputs, return_dense=True, return_sparse=True, return_colbert=True,
                                        return_sparse_embedding=True)
            corpus_output = self.model(corpus_inputs, return_dense=True, return_sparse=True, return_colbert=True,
                                       return_sparse_embedding=True)

            q_dense_vecs, q_sparse_vecs, q_colbert_vecs = queries_output['dense_vecs'], queries_output['sparse_vecs'], \
            queries_output['colbert_vecs']
            p_dense_vecs, p_sparse_vecs, p_colbert_vecs = corpus_output['dense_vecs'], corpus_output['sparse_vecs'], \
            corpus_output['colbert_vecs']

            dense_scores = self.model.dense_score(q_dense_vecs, p_dense_vecs)
            sparse_scores = self.model.sparse_score(q_sparse_vecs, p_sparse_vecs)
            colbert_scores = self.model.colbert_score(q_colbert_vecs, p_colbert_vecs,
                                                      q_mask=queries_inputs['attention_mask'])

            if weights_for_different_modes is None:
                weights_for_different_modes = [1, 1., 1.]
                weight_sum = 3
                print("default weights for dense, sparse, colbert are [1.0, 1.0, 1.0] ")
            else:
                assert len(weights_for_different_modes) == 3
                weight_sum = sum(weights_for_different_modes)

            inx = torch.arange(0, len(sentences_batch))
            dense_scores, sparse_scores, colbert_scores = dense_scores[inx, inx].float(), sparse_scores[
                inx, inx].float(), colbert_scores[inx, inx].float()

            all_scores['colbert'].extend(
                colbert_scores.cpu().numpy().tolist()
            )
            all_scores['sparse'].extend(
                sparse_scores.cpu().numpy().tolist()
            )
            all_scores['dense'].extend(
                dense_scores.cpu().numpy().tolist()
            )
            all_scores['sparse+dense'].extend(
                ((sparse_scores * weights_for_different_modes[1] + dense_scores * weights_for_different_modes[0])/(weights_for_different_modes[1]+weights_for_different_modes[0])).cpu().numpy().tolist()
            )
            all_scores['colbert+sparse+dense'].extend(
                ((colbert_scores * weights_for_different_modes[2] + sparse_scores * weights_for_different_modes[1] + dense_scores * weights_for_different_modes[0])/weight_sum).cpu().numpy().tolist()
            )

        if one_input_pair:
            return {k: v[0] for k, v in all_scores.items()}
        return all_scores
staoxiao commented 5 months ago

Hey @staoxiao I changed the code for the encoding to follow the same batching as the compute_score method and it seems to speed it up significantly, is there any reason why not to do this?

` def encode(self, sentences: Union[List[str], str], batch_size: int = 12, max_length: int = 8192, return_dense: bool = True, return_sparse: bool = False, return_colbert_vecs: bool = False) -> Dict:

    if self.num_gpus > 1:
        batch_size *= self.num_gpus
    self.model.eval()

    input_was_string = False
    if isinstance(sentences, str):
        sentences = [sentences]
        input_was_string = True        

    all_dense_embeddings, all_lexical_weights, all_colbert_vec = [], [], []
    for start_index in tqdm(range(0, len(sentences), batch_size), desc='encoding', mininterval=10):

        sentences_batch = sentences[start_index:start_index + batch_size]
        batch_data = self.tokenizer(sentences_batch, return_tensors="pt", max_length=max_length,
            padding=True,
            return_token_type_ids=False,
            truncation=True,)
        batch_data = batch_data.to(self.device)

        output = self.model(batch_data,
                            return_dense=return_dense,
                            return_sparse=return_sparse,
                            return_colbert=return_colbert_vecs)
        if return_dense:
            all_dense_embeddings.append(output['dense_vecs'].cpu().numpy())

        if return_sparse:
            token_weights = output['sparse_vecs'].squeeze(-1)
            all_lexical_weights.extend(list(map(self._process_token_weights, token_weights.cpu().numpy(),
                                                batch_data['input_ids'].cpu().numpy().tolist())))

        if return_colbert_vecs:
            all_colbert_vec.extend(list(map(self._process_colbert_vecs, output['colbert_vecs'].cpu().numpy(),
                                            batch_data['attention_mask'].cpu().numpy())))

    if return_dense:
        all_dense_embeddings = np.concatenate(all_dense_embeddings, axis=0)

    if return_dense:
        if input_was_string:
            all_dense_embeddings = all_dense_embeddings[0]
    else:
        all_dense_embeddings = None

    if return_sparse:
        if input_was_string:
            all_lexical_weights = all_lexical_weights[0]
    else:
        all_lexical_weights = None

    if return_colbert_vecs:
        if input_was_string:
            all_colbert_vec = all_colbert_vec[0]
    else:
        all_colbert_vec = None

    return {"dense_vecs": all_dense_embeddings, "lexical_weights": all_lexical_weights,
            "colbert_vecs": all_colbert_vec}`

When dealing with a large amount of data, using a dataloader to load the data will be much faster (due to the dataloader enabling multi-threaded processing).