Streamline embeddings from "non-embedding" models - Githubissues

ggerganov / llama.cpp

LLM inference in C/C++

MIT License

60.95k stars 8.7k forks source link

Streamline embeddings from "non-embedding" models #8087

Open iamlemec opened 5 days ago

iamlemec commented 5 days ago

The goal here is to get the big embedding models at the top of the MTEB leaderboard working. There are two changes:

Fix an inconsistency in output counting for u_batches. Now batch.logits is fully ignored for pooled embeddings.
Add an attention_type to llama_contex_params that allows for causal, non-causal, or unspecified (model default).

With this PR, we can get accurate results (matching HF) from at least the number 2 spot gte-Qwen2-7B-instruct. For instance, with the command:

./llama-embedding -m gte-qwen2-7b-instruct-f16.gguf -p "hello world" -ngl 99 --pooling last --at
tention non-causal -c 512

[x] I have read the contributing guidelines
Self-reported review complexity:
- [X] Low
- [ ] Medium
- [ ] High

iamlemec commented 1 day ago

@compilade cool! just rebased to master