mgrosso commented 7 months ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

[x] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
[x] I carefully followed the README.md.
[x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[x] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description: Write Mean Pooled Embedding vector to user supplied destination with optional skip tokens

(tl;dr? see https://github.com/ggerganov/llama.cpp/pull/6753 )

a new function, llama_get_mean_pooled(ctx, skip_token_count, dest)

should write the n_embd embedding floats to dest, optionally skipping some initial tokens, collecting them and averaging them over ctx->embd.

Motivation

Make the newest embedding models easier to use and performant with llama.cpp and it's language bindings.

make it easier to use newest LLM based embedding models

GritLM, e5-mistral, and echo-mistral are all open source transformer based models in the top 10 of the MTEB leaderboard. Because they are not BERT models they don't get inp_mean allocated for them regardless of how pooling type is set. GritLM and e5-mistral were evaluated with mean pooling so the examples/gritlm/gritlm.cpp does this manually off device. It would be nice to have this done in the main project to make it easier to use these new embedding models.

make it performant to use newest LLM based embedding models in ruby and other language bindings

Calculating the mean embeddings for large token embeddings in interpreted languages is not desirable. Writing the embed vector to user supplied memory will make language integrations easier.

Possible Implementation

see https://github.com/ggerganov/llama.cpp/pull/6753 for a potential implementation

mgrosso commented 7 months ago

this should be rewritten to simply express the fact that you can't use a non-BERT model if you specify pooling_type=mean, and you have to do mean pooling on llama models to get state of the art embedding performance. Doing the mean pooling outside of the device (which could be the cpu but is definitely not interpreted) is a serious performance issue for interpreted languages.

github-actions[bot] commented 5 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

ggerganov / llama.cpp

write mean pooled embedding to callers vector to simplify using SoTA embedding models and language bindings #6754