The LLamaEmbedder at the moment does not expose all of the embedding capabilities of llama.cpp.
Currently:
It only returns 1 single vector as a float[]
Some models return a single vector which represents the entire input sequence (embeddings models) and some produce an embedding vector per token (generative models).
Pooling mode can be set with some models which sets a method for converting lots of embeddings into one embedding. Probably only compatible with some models?
Improvements:
Indicate which type of model an embedder was created with.
Indicate how many results there are.
Use llama_get_embeddings, llama_get_embeddings_ith and llama_get_embeddings_seq as appropriate to get the correct embeddings.
Random things to consider in no particular order:
What should be returned? float[][] or a Span<float>?
When tokens are added to the batch, should the logits flag be set for all tokens, none of the tokens, or just the last token?
Does this change with different models (generative vs embedding vs generative with pooling)?
Description
The
LLamaEmbedder
at the moment does not expose all of the embedding capabilities of llama.cpp.Currently:
float[]
Improvements:
llama_get_embeddings
,llama_get_embeddings_ith
andllama_get_embeddings_seq
as appropriate to get the correct embeddings.Random things to consider in no particular order:
float[][]
or aSpan<float>
?logits
flag be set for all tokens, none of the tokens, or just the last token?