Closed mgrosso closed 5 months ago
this should be rewritten to simply express the fact that you can't use a non-BERT model if you specify pooling_type=mean
, and you have to do mean pooling on llama models to get state of the art embedding performance. Doing the mean pooling outside of the device (which could be the cpu but is definitely not interpreted) is a serious performance issue for interpreted languages.
This issue was closed because it has been inactive for 14 days since being marked as stale.
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Feature Description: Write Mean Pooled Embedding vector to user supplied destination with optional skip tokens
(tl;dr? see https://github.com/ggerganov/llama.cpp/pull/6753 )
a new function,
llama_get_mean_pooled(ctx, skip_token_count, dest)
should write the
n_embd
embedding floats todest
, optionally skipping some initial tokens, collecting them and averaging them overctx->embd
.Motivation
Make the newest embedding models easier to use and performant with llama.cpp and it's language bindings.
make it easier to use newest LLM based embedding models
GritLM, e5-mistral, and echo-mistral are all open source transformer based models in the top 10 of the MTEB leaderboard. Because they are not BERT models they don't get
inp_mean
allocated for them regardless of how pooling type is set. GritLM and e5-mistral were evaluated with mean pooling so theexamples/gritlm/gritlm.cpp
does this manually off device. It would be nice to have this done in the main project to make it easier to use these new embedding models.make it performant to use newest LLM based embedding models in ruby and other language bindings
Calculating the mean embeddings for large token embeddings in interpreted languages is not desirable. Writing the embed vector to user supplied memory will make language integrations easier.
Possible Implementation
see https://github.com/ggerganov/llama.cpp/pull/6753 for a potential implementation