Open giladgd opened 1 month ago
Thank you @0cc4m!
@0cc4m I've tested the latest release, and decoding now works well with more than one context 🚀 However, I've encountered another issue where decoding multiple contexts in parallel from different threads crashes the process (this happens only in Vulkan). Is it possible to make the decoding thread safe in Vulkan?
This issue can be replicated with this code:
void embed_text(const char * text, llama_model * model, llama_context * context) {
std::vector<llama_token> tokens = llama_tokenize(model, text, false, false);
auto n_tokens = tokens.size();
auto batch = llama_batch_init(n_tokens, 0, 1);
for (size_t i = 0; i < n_tokens; i++) {
llama_batch_add(batch, tokens[i], i, { 0 }, false);
}
batch.logits[batch.n_tokens - 1] = true;
llama_decode(context, batch);
llama_synchronize(context);
const int n_embd = llama_n_embd(model);
const auto* embeddings = llama_get_embeddings_seq(context, 0);
if (embeddings == NULL) {
embeddings = llama_get_embeddings_ith(context, tokens.size() - 1);
if (embeddings == NULL) {
printf("Failed to get embedding");
}
}
if (embeddings != NULL) {
printf("Embeddings: ");
for (size_t i = 0; i < n_embd; ++i) {
printf("%f ", embeddings[i]);
}
}
llama_batch_free(batch);
}
void main() {
llama_backend_init();
auto model_params = llama_model_default_params();
model_params.n_gpu_layers = 33;
auto model_path = "/home/user/models/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf";
auto model = llama_load_model_from_file(model_path, model_params);
auto text1 = "Hi there";
auto text2 = "Hello there";
auto context_params = llama_context_default_params();
context_params.embeddings = true;
context_params.seed = time(NULL);
context_params.n_ctx = 4096;
context_params.n_threads = 6;
context_params.n_threads_batch = context_params.n_threads;
context_params.n_batch = 512;
context_params.n_ubatch = 512;
auto context1 = llama_new_context_with_model(model, context_params);
auto context2 = llama_new_context_with_model(model, context_params);
// one of these threads causes the process to crash
std::thread thread1(embed_text, text1, model, context1);
std::thread thread2(embed_text, text2, model, context2);
thread1.join();
thread2.join();
llama_free(context1);
llama_free(context2);
llama_free_model(model);
llama_backend_free();
}
What happened?
There seems to be some kind of memory overlap between contexts created with the same model with the Vulkan backend when the contexts are loaded at the same time. Freeing the first context before creating the second one works as expected, though. Other backends support having multiple contexts at the same time, so I think Vulkan should support it, too.
The following code crashes with
signal SIGSEGV, Segmentation fault
:Using
gdb
shows this stack trace:I've used this model in this code.
Name and Version
I tested the above code with release
b3012
.What operating system are you seeing the problem on?
Linux
Relevant log output