SciSharp / LLamaSharp

A C#/.NET library to run LLM (🦙LLaMA/LLaVA) on your local device efficiently.
https://scisharp.github.io/LLamaSharp
MIT License
2.7k stars 349 forks source link

Embeddings: batch size vs context length #963

Open dluc opened 3 weeks ago

dluc commented 3 weeks ago

Description

I’m using two models, openchat_3.5.Q5_K_M.gguf to generate text and nomic-embed-text-v1.5.Q8_0.gguf to calculate text embeddings.

When I input text that exceeds 512 tokens - in my case, it’s 979 tokens - embedding generation throws this exception:

System.ArgumentException: Input contains more tokens than configured batch size (Parameter 'batch') at LLama.LLamaContext.Decode(LLamaBatch batch)

However, the model documentation specifies a context length of 8192 tokens.

Questions:

martindevans commented 3 weeks ago

Batch size is the maximum number of tokens that can be processed at once, it's separate from the context size.

For text generation you can feed the model multiple batches before generating a response.

For embedding, I think right now the embedder requires that all of your text is sent in one batch. So you'll need a larger batch for embeddings.

does LlamaSharp support loading two different models?

You're already loading openchat_3.5.Q5_K_M.gguf and nomic-embed-text-v1.5.Q8_0.gguf, which is two different models. So I'm not quite sure what you're asking, sorry.

dluc commented 3 weeks ago

Here's our code, where the exception is thrown:

public async Task<Embedding> GenerateEmbeddingAsync(string text)
{
    if (this._log.IsEnabled(LogLevel.Trace))
    {
        this._log.LogTrace("Generating embedding, input token size: {0}", this._textTokenizer.CountTokens(text));
    }

    // Throws `System.ArgumentException`
    var embeddings = await this._embedder.GetEmbeddings(text);

    return new Embedding(embeddings[0]);
}

Batch size is the maximum number of tokens that can be processed at once, it's separate from the context size.

The string is 979 tokens, and I would expect GetEmbeddings to generate one embedding (one array with a single element to be precise).

Is there something to change in the method above?

martindevans commented 3 weeks ago

The string is 979 tokens, and I would expect GetEmbeddings to generate one embedding (one array with a single element to be precise).

Sounds right.

Since you must process everything for embeddings in one batch that means you batch size must be set to 979, or greater.

dluc commented 3 weeks ago

Looking at the examples, there's no code about the batch size - how is the batch size set?

e.g. https://github.com/SciSharp/LLamaSharp/blob/master/LLama.Examples/Examples/GetEmbeddings.cs

dluc commented 3 weeks ago

trying to run https://github.com/SciSharp/LLamaSharp/blob/master/LLama.Examples/Examples/GetEmbeddings.cs and it throws the same exception:

Unhandled exception. System.ArgumentException: Input contains more tokens than configured batch size (Parameter 'batch') at LLama.LLamaContext.Decode(LLamaBatch batch) in LLama/LLamaContext.cs:line 403 at LLama.LLamaContext.<>c__DisplayClass42_0.b0() in LLama/LLamaContext.cs:line 414 at System.Threading.Tasks.Task`1.InnerInvoke() at System.Threading.Tasks.Task.<>c.<.cctor>b281_0(Object obj) at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state) --- End of stack trace from previous location --- at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state) at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread) --- End of stack trace from previous location --- at LLama.LLamaEmbedder.GetEmbeddings(String input, CancellationToken cancellationToken) in LLama/LLamaEmbedder.cs:line 88 at LLama.Examples.Examples.GetEmbeddings.Run() in LLama.Examples/Examples/GetEmbeddings.cs:line 42 at ExampleRunner.Run() in LLama.Examples/ExampleRunner.cs:line 57 at Program.

$(String[] args) in LLama.Examples/Program.cs:line 40 at Program.
(String[] args)

martindevans commented 3 weeks ago

Batch size is set in the ModelParams, see here. If you don't set it, the default is 512, which is large enough for the examples but not your use case.

dluc commented 2 weeks ago

Wouldn't it be easier if batch size was automatically set to match max tokens? is there any benefit from having a lower default?

For instance, if a model supports up to 8192 tokens per embedding, automatically setting batch size to 8192 would replicate the behavior seen in HF, OpenAI, etc.

martindevans commented 2 weeks ago

A large batch size is costly (it takes extra memory). It's generally not worth making very large since (for text generation) after the initial prompt you'll be submitting just one single token at a time. For embedding it's different, you must make the batch size as large as the largest amount of data you'll ever need an embedding for, since it can't be split across multiple batches (currently).