FlagOpen / FlagEmbedding

Retrieval and Retrieval-augmented LLMs
MIT License
6.71k stars 480 forks source link

BGE-M3 Sparse #532

Closed andrePankraz closed 1 month ago

andrePankraz commented 5 months ago

Currently i cannot really use the "Sparse mode" of BGE-M3. Even with 8 GB VRAM and small batch sizes I get CUDA out of mem. Why does this mode need so much VRAM? Is this to be expected? The other modes (Dense/ColBERT) don't run into this, even with large batches. Can this somehow be mitigated? Split over GPUs?

File "/usr/lib/python3/dist-packages/FlagEmbedding/BGE_M3/modeling.py", line 357, in forward sparse_vecs = self.sparse_embedding(last_hidden_state, text_input['input_ids'], File "/usr/lib/python3/dist-packages/FlagEmbedding/BGE_M3/modeling.py", line 106, in sparse_embedding sparse_embedding = torch.zeros(input_ids.size(0), input_ids.size(1), self.vocab_size, torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.66 GiB.

hanhainebula commented 5 months ago

Hello! Could you paste your code here? I will check it.

andrePankraz commented 5 months ago

Hi, I just call your methods without much fluff:

model = BGEM3FlagModel(
    "BAAI/bge-m3", use_fp16=True
)
    passages_outputs = model.model(
        passages_inputs,
        return_dense=False,
        return_sparse=True,
        return_colbert=False,
        return_sparse_embedding=True
    )

I just follow compute_score here https://github.com/FlagOpen/FlagEmbedding/blob/11dc092e39ed0ff6e715866b2bdaca0cc775a296/FlagEmbedding/bge_m3.py#L188 which also uses sparse_vecs and not lexical_weights.

Nothing special. But I think I understand the problem. See method sparse_embedding here: https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/BGE_M3/modeling.py

    sparse_embedding = torch.zeros(input_ids.size(0), input_ids.size(1), self.vocab_size,
                                   dtype=token_weights.dtype,
                                   device=token_weights.device)

Because the vocab size is quite large with 250.000, this method tries to get 250.000 4 byte 512 Tokens = 0.5 GB per sequence (assuming sequence is just 512 tokens short, what I have). So a batch with 10 entries of short 512-token sequences already needs 5 GB ! That doesn't scale well...and I will not talk about 8k sequences here.

Scattering this single token weights over such a huge sparse tensor just for some max-pooling operation doesn't sound efficient to me - the memory I/O preasure on VRAM is intense, even if it would be available - but I don't have the time to deep dive in EmbeddingBag etc.

staoxiao commented 5 months ago

We recommend to use encode function, which will return a dict instead of a sparse embedding. You can refer to https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3#generate-embedding-for-text

from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel('BAAI/bge-m3',  use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation

sentences_1 = ["What is BGE M3?", "Defination of BM25"]
sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.", 
               "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]

output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=False)
output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True, return_colbert_vecs=False)

# you can see the weight for each token:
print(model.convert_id_to_token(output_1['lexical_weights']))
# [{'What': 0.08356, 'is': 0.0814, 'B': 0.1296, 'GE': 0.252, 'M': 0.1702, '3': 0.2695, '?': 0.04092}, 
#  {'De': 0.05005, 'fin': 0.1368, 'ation': 0.04498, 'of': 0.0633, 'BM': 0.2515, '25': 0.3335}]

# compute the scores via lexical mathcing
lexical_scores = model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_2['lexical_weights'][0])
print(lexical_scores)
# 0.19554901123046875

print(model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_1['lexical_weights'][1]))
# 0.0
andrePankraz commented 5 months ago

Thank you all, I will try.

In this case you should adapt your example https://huggingface.co/BAAI/bge-m3 "Compute score for text pairs" which uses method model.compute_score() which in turn uses sparse embeddings?

Is it really the same quality at the end? I don't fully understand the difference right now, when to use token-weights and when sparse-vecs. sparse-vecs for training and token-dict for inference?

staoxiao commented 5 months ago

There is no difference between the results of sparse-vecs and token-weights. sparse-vecs is suitable for training in GPUs, because it can be used as tensor.