Wherever possible, it is better not to create a Context (this increases the memory used).
For example, you can use:
weights.Tokenize()
Instead of:
context.Tokenize()
The problem of multithreading. It occurs when embeddings begin to be created at the same time and a question is asked about the model (Executor). This is a big problem and I think it does not apply to KM. I think that some native api calls need to be protected from simultaneous calls.
Error:
Error: CUDA error: the function failed to launch on the GPU
current device: 0, in function ggml_cuda_mul_mat_batched_cublas at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda.cu:1889
cublasGemmBatchedEx(ctx.cublas_handle(), CUBLAS_OP_T, CUBLAS_OP_N, ne01, ne11, ne10, alpha, (const void ) (ptrs_src.get() + 0*ne23), CUDA_R_16F, nb01/nb00, (const void *) (ptrs_src.get() + 1ne23), CUDA_R_16F, nb11/nb10, beta, ( void ) (ptrs_dst.get() + 0*ne23), cu_data_type, ne01, ne23, cu_compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)
CUDA error: operation not permitted when stream is capturing
current device: 0, in function ggml_backend_cuda_buffer_clear at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda.cu:535
cudaDeviceSynchronize()
SafeLLamaContextHandle.llama_new_context_with_model
Or here is such an error when calling at the same time:
...
LLamaStatelessExecutor executor = new(Weights, ModelParams);
...
await foreach (string text in executor.InferAsync(prompt, DefaultInferenceParams, cancellationToken))
{
sb.Append(text);
}
...
CUDA error: operation failed due to a previous error during capture
SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode => <>cDisplayClass17_0.b0
current device: 0, in function ggml_backend_cuda_graph_compute at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda.cu:2632
SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode => <>cDisplayClass17_0.b0
cudaStreamEndCapture(cuda_ctx->stream(), &cuda_ctx->cuda_graph->graph)
SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode => <>c__DisplayClass17_0.b__0
Description
Wherever possible, it is better not to create a Context (this increases the memory used). For example, you can use: weights.Tokenize() Instead of: context.Tokenize()
The problem of multithreading. It occurs when embeddings begin to be created at the same time and a question is asked about the model (Executor). This is a big problem and I think it does not apply to KM. I think that some native api calls need to be protected from simultaneous calls.
Error:
Error: CUDA error: the function failed to launch on the GPU current device: 0, in function ggml_cuda_mul_mat_batched_cublas at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda.cu:1889 cublasGemmBatchedEx(ctx.cublas_handle(), CUBLAS_OP_T, CUBLAS_OP_N, ne01, ne11, ne10, alpha, (const void ) (ptrs_src.get() + 0*ne23), CUDA_R_16F, nb01/nb00, (const void *) (ptrs_src.get() + 1ne23), CUDA_R_16F, nb11/nb10, beta, ( void ) (ptrs_dst.get() + 0*ne23), cu_data_type, ne01, ne23, cu_compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP) CUDA error: operation not permitted when stream is capturing current device: 0, in function ggml_backend_cuda_buffer_clear at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda.cu:535 cudaDeviceSynchronize() SafeLLamaContextHandle.llama_new_context_with_model
Or here is such an error when calling at the same time:
... LLamaStatelessExecutor executor = new(Weights, ModelParams); ... await foreach (string text in executor.InferAsync(prompt, DefaultInferenceParams, cancellationToken)) { sb.Append(text); } ...
CUDA error: operation failed due to a previous error during capture SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode => <>cDisplayClass17_0.b0
current device: 0, in function ggml_backend_cuda_graph_compute at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda.cu:2632
SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode => <>cDisplayClass17_0.b 0
cudaStreamEndCapture(cuda_ctx->stream(), &cuda_ctx->cuda_graph->graph)
SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode => <>c__DisplayClass17_0.b__0
How can all these problems be solved?