SciSharp / LLamaSharp

A C#/.NET library to run LLM (🦙LLaMA/LLaVA) on your local device efficiently.
https://scisharp.github.io/LLamaSharp
MIT License
2.71k stars 350 forks source link

A few moments in the process of work LLamaSharp & KernelMemory #923

Open aropb opened 2 months ago

aropb commented 2 months ago

Description

  1. Wherever possible, it is better not to create a Context (this increases the memory used). For example, you can use: weights.Tokenize() Instead of: context.Tokenize()

  2. The problem of multithreading. It occurs when embeddings begin to be created at the same time and a question is asked about the model (Executor). This is a big problem and I think it does not apply to KM. I think that some native api calls need to be protected from simultaneous calls.

Error:

Error: CUDA error: the function failed to launch on the GPU current device: 0, in function ggml_cuda_mul_mat_batched_cublas at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda.cu:1889 cublasGemmBatchedEx(ctx.cublas_handle(), CUBLAS_OP_T, CUBLAS_OP_N, ne01, ne11, ne10, alpha, (const void ) (ptrs_src.get() + 0*ne23), CUDA_R_16F, nb01/nb00, (const void *) (ptrs_src.get() + 1ne23), CUDA_R_16F, nb11/nb10, beta, ( void ) (ptrs_dst.get() + 0*ne23), cu_data_type, ne01, ne23, cu_compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP) CUDA error: operation not permitted when stream is capturing current device: 0, in function ggml_backend_cuda_buffer_clear at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda.cu:535 cudaDeviceSynchronize() SafeLLamaContextHandle.llama_new_context_with_model

Or here is such an error when calling at the same time:

... LLamaStatelessExecutor executor = new(Weights, ModelParams); ... await foreach (string text in executor.InferAsync(prompt, DefaultInferenceParams, cancellationToken)) { sb.Append(text); } ...

CUDA error: operation failed due to a previous error during capture SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode => <>cDisplayClass17_0.b0 current device: 0, in function ggml_backend_cuda_graph_compute at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda.cu:2632 SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode => <>cDisplayClass17_0.b0 cudaStreamEndCapture(cuda_ctx->stream(), &cuda_ctx->cuda_graph->graph) SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode => <>c__DisplayClass17_0.b__0

How can all these problems be solved?