llama : support reranking API endpoint and models

ciekawy commented 1 month ago

Prerequisites

[X] I am running the latest code. Mention the version if possible as well.
[X] I carefully followed the README.md.
[X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[X] I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Support reranking API and models.

Motivation

Reranking is currently very common techniques used along with embeddings in RAG systems. Also there are models where same model instance can be used for both embeddings and reranking - that is great resource optimisation.

Possible Implementation

Reranking is relatively close to embeddings and there are models for both embed/rerank like bge-m3 - supported by llama.cpp with --embed. I'm guessing that one possible challenge/dilemma is that for inference and embed the OpenAI API schema is being used and OpenAI does not offer rerank API. I think currently there is Jina rerank API commonly used in other projects. I think that in terms of actual reranking there should not be very complex as it is quite related to embedding calls.

ciekawy commented 1 month ago

I saw just one discussion opened for reranking https://github.com/ggerganov/llama.cpp/discussions/8216, and possibly two loosely related tickets - linking for visibility

https://github.com/ggerganov/llama.cpp/issues/5403 https://github.com/ggerganov/llama.cpp/issues/5954

and so slightly related but rather out of scope of this ticket is also support for more formats to be converted to gguf

rujialiu commented 1 month ago

I'm developing a lightweight (in terms of disk usage) local RAG application. Embedding/LLM is handled very well by llama.cpp, but reranker is a headable. My reranker of choice is 2GB in disk space (bge-reranker-v2-m3), which is bigger than embedding+LLM together. Huggingface's text-embedding-inference is fast, but it doesn't support any quatization (at least in an obvious way); infinity_emb supports onnx's int8 quantization but not lightweight. If llama.cpp supports reranker, I would definitely use it for all embedding/reranking/LLM.

ggerganov commented 1 month ago

I am not familiar with the concept of "reranking" - do you have some good resource, or can you explain it in simple terms here?

ciekawy commented 1 month ago

TL;DR Reranking involves taking a set of search results and reordering them based on a specific query to better match the query :)

here all is nicely described: https://jina.ai/reranker/

rujialiu commented 1 month ago

We can also reduce token usage and hallucination by filtering out low-score documents before feeding to LLM, which is especially useful when developing tool-using agents: suppose you have 1000 built-in tools and don't want to pass all of them to LLM, then a good way is to use embedding to get, say, top-30 similar tools first and then use reranker to retrieve highly relavent tools only. Embedding + vector search is fast, but much less accurate than reranker, so this embedding+reranker+LLM workflow works very well in practice.

foldl commented 1 month ago

FYI: chatllm.cpp supports 2 re-ranker models, and RAG of course.

foldl commented 1 month ago

Re-ranking models output a score on a pair of a question and a text chunk, measuring how the chunk fit as an answer.

ggerganov commented 1 month ago

Got it. I assume there are some special tokens that are used to specify which text is the question and which text is the answer? And it seems that instead of a LM head, the model ends with a classification head. Is the attention non-causal?

foldl commented 1 month ago

In the case of XLMRobertaForSequenceClassification, used by bge-rereanker-m3, bce-reranker, etc, Q&A are encoded as:

bos question eos 
bos answer eos

It is non-causal.

ciekawy commented 1 month ago

it may be worth having a look at the actual rerankers, and their config files

ggerganov / llama.cpp