Open dluc opened 3 weeks ago
Batch size is the maximum number of tokens that can be processed at once, it's separate from the context size.
For text generation you can feed the model multiple batches before generating a response.
For embedding, I think right now the embedder requires that all of your text is sent in one batch. So you'll need a larger batch for embeddings.
does LlamaSharp support loading two different models?
You're already loading openchat_3.5.Q5_K_M.gguf
and nomic-embed-text-v1.5.Q8_0.gguf
, which is two different models. So I'm not quite sure what you're asking, sorry.
Here's our code, where the exception is thrown:
public async Task<Embedding> GenerateEmbeddingAsync(string text)
{
if (this._log.IsEnabled(LogLevel.Trace))
{
this._log.LogTrace("Generating embedding, input token size: {0}", this._textTokenizer.CountTokens(text));
}
// Throws `System.ArgumentException`
var embeddings = await this._embedder.GetEmbeddings(text);
return new Embedding(embeddings[0]);
}
Batch size is the maximum number of tokens that can be processed at once, it's separate from the context size.
The string is 979 tokens, and I would expect GetEmbeddings
to generate one embedding (one array with a single element to be precise).
Is there something to change in the method above?
The string is 979 tokens, and I would expect GetEmbeddings to generate one embedding (one array with a single element to be precise).
Sounds right.
Since you must process everything for embeddings in one batch that means you batch size must be set to 979, or greater.
Looking at the examples, there's no code about the batch size - how is the batch size set?
e.g. https://github.com/SciSharp/LLamaSharp/blob/master/LLama.Examples/Examples/GetEmbeddings.cs
trying to run https://github.com/SciSharp/LLamaSharp/blob/master/LLama.Examples/Examples/GetEmbeddings.cs and it throws the same exception:
Unhandled exception. System.ArgumentException: Input contains more tokens than configured batch size (Parameter 'batch') at LLama.LLamaContext.Decode(LLamaBatch batch) in LLama/LLamaContext.cs:line 403 at LLama.LLamaContext.<>c__DisplayClass42_0.
b0() in LLama/LLamaContext.cs:line 414 at System.Threading.Tasks.Task`1.InnerInvoke() at System.Threading.Tasks.Task.<>c.<.cctor>b281_0(Object obj) at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state) --- End of stack trace from previous location --- at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state) at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread) --- End of stack trace from previous location --- at LLama.LLamaEmbedder.GetEmbeddings(String input, CancellationToken cancellationToken) in LLama/LLamaEmbedder.cs:line 88 at LLama.Examples.Examples.GetEmbeddings.Run() in LLama.Examples/Examples/GetEmbeddings.cs:line 42 at ExampleRunner.Run() in LLama.Examples/ExampleRunner.cs:line 57 at Program. $(String[] args) in LLama.Examples/Program.cs:line 40 at Program. (String[] args)
Batch size is set in the ModelParams
, see here. If you don't set it, the default is 512, which is large enough for the examples but not your use case.
Wouldn't it be easier if batch size was automatically set to match max tokens? is there any benefit from having a lower default?
For instance, if a model supports up to 8192 tokens per embedding, automatically setting batch size to 8192 would replicate the behavior seen in HF, OpenAI, etc.
A large batch size is costly (it takes extra memory). It's generally not worth making very large since (for text generation) after the initial prompt you'll be submitting just one single token at a time. For embedding it's different, you must make the batch size as large as the largest amount of data you'll ever need an embedding for, since it can't be split across multiple batches (currently).
Description
I’m using two models,
openchat_3.5.Q5_K_M.gguf
to generate text andnomic-embed-text-v1.5.Q8_0.gguf
to calculate text embeddings.When I input text that exceeds 512 tokens - in my case, it’s 979 tokens - embedding generation throws this exception:
However, the model documentation specifies a context length of 8192 tokens.
Questions: