Improve `LLamaEmbedder`

martindevans commented 3 months ago

The LLamaEmbedder at the moment does not expose all of the embedding capabilities of llama.cpp.

Currently:

It only returns 1 single vector as a float[]
Some models return a single vector which represents the entire input sequence (embeddings models) and some produce an embedding vector per token (generative models).
Pooling mode can be set with some models which sets a method for converting lots of embeddings into one embedding. Probably only compatible with some models?

Improvements:

Indicate which type of model an embedder was created with.
Indicate how many results there are.
Use llama_get_embeddings, llama_get_embeddings_ith and llama_get_embeddings_seq as appropriate to get the correct embeddings.

Random things to consider in no particular order:

What should be returned? float[][] or a Span<float>?
When tokens are added to the batch, should the logits flag be set for all tokens, none of the tokens, or just the last token?
- Does this change with different models (generative vs embedding vs generative with pooling)?
Should returned embeddings always be normalized?

martindevans commented 3 months ago

@Lyrcaxis Some notes on LLamaEmbedder improvements, as discussed in Discord.

martindevans commented 3 months ago

See #902 for a start on some of this.

SciSharp / LLamaSharp