SciSharp / LLamaSharp

A C#/.NET library to run LLM (🦙LLaMA/LLaVA) on your local device efficiently.
https://scisharp.github.io/LLamaSharp
MIT License
2.63k stars 342 forks source link

using CUDA when both CPU and Cuda12 back-ends are present. #456

Open vvdb-architecture opened 9 months ago

vvdb-architecture commented 9 months ago

I'm using Kernel-memory with LLamaSharp. Despite having a RTX 3080 and the latest CUDA drivers installed, CUDA is not used.

Not sure if this is a bug or I'm missing something, so here's a question instead:

The LlamaSharp.csproj contains

 <PackageReference Include="LLamaSharp.Backend.Cpu"/>
 <PackageReference Include="LLamaSharp.Backend.Cuda12"/>

I found out that if both Cpu and Cuda12 back-ends are referenced, only the CPU is being used even if the CUDA DLL is loaded. Interestingly, the logs do say that the CUDA back-end is loaded, but no Cuda is used.

[LLamaSharp Native] [Info] Detected OS Platform: WINDOWS
[LLamaSharp Native] [Info] Detected cuda major version 12.
[LLamaSharp Native] [Info] ./runtimes/win-x64/native/cuda12/libllama.dll is selected and loaded successfully.
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from D:\Source\km\Models\llama-2-7b-chat.Q5_K_M.gguf (version GGUF V2)
...

If I remove the reference to LLamaSharp.Backend.Cpu, then the CUDA back-end will start to be used. The logs show:

[LLamaSharp Native] [Info] Detected OS Platform: WINDOWS
[LLamaSharp Native] [Info] Detected cuda major version 12.
[LLamaSharp Native] [Info] ./runtimes/win-x64/native/cuda12/libllama.dll is selected and loaded successfully.
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from D:\Source\km\Models\llama-2-7b-chat.Q5_K_M.gguf (version GGUF V2)
...

I've reported this to the kernel memory project, but was advised to report this here.

vvdb-architecture commented 9 months ago

It seems that the CPU back-end and the Cuda back-ends can't be installed at the same time.

If this is by design, the issue can be closed, but since I don't know if this is by design, I'll leave the issue open for others to comment.

martindevans commented 9 months ago

Originally they weren't meant to be installed together, since it then wasn't clear which binaries should be used. However we now have the runtime detection which should probe your system and load the best binaries possible.

In your case that looks like it is working, since the logs say:

[LLamaSharp Native] [Info] ./runtimes/win-x64/native/cuda12/libllama.dll is selected and loaded successfully.

But then for some reason that isn't actually using your GPU! I think this probably is a real bug

adammikulis commented 8 months ago

Hi @vvdb-architecture , if it is not using your GPU even with the Cuda backend do you have your GpuLayerCount in your ModelParams set to -1, or 1-33? If it is not set or set to 0 it will default to cpu-only, even with just the Cuda backend installed. Sorry if I misunderstand your problem, but this may help other users if they have that issue:

image
vvdb-architecture commented 8 months ago

Hi @vvdb-architecture , if it is not using your GPU even with the Cuda backend do you have your GpuLayerCount in your ModelParams set to -1, or 1-33? If it is not set or set to 0 it will default to cpu-only, even with just the Cuda backend installed. Sorry if I misunderstand your problem, but this may help other users if they have that issue:

image

It's set to 33.

vvdb-architecture commented 8 months ago

I think this issue can be closed, since the docs explicitly states you can only install one of the back-ends.

AsakusaRinne commented 7 months ago

@vvdb-architecture Sorry for seeing this issue late. It should be my duty to resolve this problem because I wrote the main part of dynamic loading of native library. #588 is also a duplication of this issue.

since the docs explicitly states you can only install one of the back-ends

Yes, but the document has been outdated for a long time. The document still stays at v0.5.0 now, while we are already proceeding to v0.11.0. In document I declared this state because dynamic loading is not supported in v0.5.0.

LLamaSharp is expected to work with multiple backend packages in current version, so I'll re-open this issue and dig on it. Thank you for your reminder in #589!

clarinevong commented 3 days ago

Hello, is there any news on this issue?

I encounter a similar issue. I have installed both LLamaSharp.Backend.CPU and LLamaSharp.Backend.Cuda12.Windows (0.18.0 versions). Following th README I added the following line to show show which native library file is loaded

NativeLibraryConfig.Instance.WithLogCallback(delegate (LLamaLogLevel level, string message) { Console.Write($"{level}: {message}"); } )

When I load a model on the CPU with GpuLayerCount equals to 0, the cuda backend is loaded

- PreferCuda: True
- PreferVulkan: True
- PreferredAvxLevel: AVX2
- AllowFallback: True
- SkipCheck: False
- SearchDirectories and Priorities: { ./ }
Info: NativeLibraryConfig Description:
- LibraryName: LLama
- Path: ''
- PreferCuda: True
- PreferVulkan: True
- PreferredAvxLevel: AVX2
- AllowFallback: True
- SkipCheck: False
- SearchDirectories and Priorities: { ./ }
Debug: Got relative library path 'runtimes/win-x64/native/cuda12/llama.dll' from local with (NativeLibraryName: LLama, UseCuda: True, UseVulkan: False, AvxLevel: None), trying to load it...
Debug: Found full path file './runtimes/win-x64/native/cuda12/llama.dll' for relative path 'runtimes/win-x64/native/cuda12/llama.dll'
Info: Successfully loaded './runtimes/win-x64/native/cuda12/llama.dll'

When I only install LLamaSharp.Backend.CPU the correct native library file is loaded

- PreferCuda: True
- PreferVulkan: True
- PreferredAvxLevel: AVX2
- AllowFallback: True
- SkipCheck: False
- SearchDirectories and Priorities: { ./ }
Debug: Got relative library path 'runtimes/win-x64/native/cuda12/llama.dll' from local with (NativeLibraryName: LLama, UseCuda: True, UseVulkan: False, AvxLevel: None), trying to load it...
Debug: Found full path file 'runtimes/win-x64/native/cuda12/llama.dll' for relative path 'runtimes/win-x64/native/cuda12/llama.dll'
Info: Failed Loading 'runtimes/win-x64/native/cuda12/llama.dll'
Debug: Got relative library path 'runtimes/win-x64/native/vulkan/llama.dll' from local with (NativeLibraryName: LLama, UseCuda: False, UseVulkan: True, AvxLevel: None), trying to load it...
Debug: Found full path file 'runtimes/win-x64/native/vulkan/llama.dll' for relative path 'runtimes/win-x64/native/vulkan/llama.dll'
Info: Failed Loading 'runtimes/win-x64/native/vulkan/llama.dll'
Debug: Got relative library path 'runtimes/win-x64/native/avx2/llama.dll' from local with (NativeLibraryName: LLama, UseCuda: False, UseVulkan: False, AvxLevel: Avx2), trying to load it...
Debug: Found full path file './runtimes/win-x64/native/avx2/llama.dll' for relative path 'runtimes/win-x64/native/avx2/llama.dll'
Info: Successfully loaded './runtimes/win-x64/native/avx2/llama.dll'
martindevans commented 3 days ago

When I load a model on the CPU with GpuLayerCount equals to 0, the cuda backend is loaded

That's how it's meant to work - if the CUDA binaries are available and compatible with your system they will be used unless you explicitly disable CUDA at load time with NativeLibraryConfig.All.WithCuda(false).

Changing GpuLayerCount changes how many layers are sent to the GPU, but does not change which backend is used. Setting it to zero should be equivalent to not using CUDA at all (although possibly slightly slower than the pure CPU binaries).