FlagOpen / FlagEmbedding

Retrieval and Retrieval-augmented LLMs
MIT License
6.02k stars 436 forks source link

BAAI/bge-reranker-v2-m3 模型中是如何計算輸入的 max_length ? #740

Open thebarkingdog-yh opened 2 months ago

thebarkingdog-yh commented 2 months ago
reranker = FlagReranker('BAAI/bge-reranker-v2-m3', use_fp16=True)
scores = reranker.compute_score(['要查詢的問題', "查詢的文檔...."], normalize=True, max_length=512)

關於這個 max_length = 512 具體是什麽單位? 是 token 還是字符長度? 超過後又是如何處理? 直接截斷嗎? reranker-v2-m3 模型本身的 max_length 有上限嗎? 這個 512 是可調整(例如拉高到1024 或 4096) 還是不建議調整?

staoxiao commented 2 months ago

max_lenth is the maximum number of tokens. We will truncate the text and only keep the first 8192 tokens.

The upper bound of max_length in bge-reranker-v2-* is 8192. A larger max_length allows the model to process long texts, but it comes with more computational consumption.

thebarkingdog-yh commented 2 months ago

max_lenth is the maximum number of tokens. We will truncate the text and only keep the first 8192 tokens.

The upper bound of max_length in bge-reranker-v2-* is 8192. A larger max_length allows the model to process long texts, but it comes with more computational consumption.

So, The maximum input to the rerank model is 8192, and will be truncated if exceeded.

But I found that the max_length in the parameters of compute_score is 512 by default. What is the purpose of this parameter? Because I actually input more than 512, it can still work, and the output value will still change.

staoxiao commented 2 months ago

A larger max_length allows the model to process long texts, but it comes with more computational consumption. The small default value: 512 is to speed up the inference. If most of your text is long, we recommend using a larger max_length.

thebarkingdog-yh commented 2 months ago

A larger max_length allows the model to process long texts, but it comes with more computational consumption. The small default value: 512 is to speed up the inference. If most of your text is long, we recommend using a larger max_length.

So, I should decide an appropriate embedding length based on the length of my comparison documents to get better results. If my use the default value, it will always compress all documents into 512 for comparison. If my documents is longer, the results might be worse. Am I understanding this correctly?