Open Apanoff opened 3 days ago
llm_load_tensors: offloaded 1/89 layers to GPU
You only offloaded a single layer. Only the KV cache for that layer is gonna get offloaded to GPU, which happened as expected:
llama_kv_cache_init: CPU KV buffer size = 739.50 MiB llama_kv_cache_init: CUDA0 KV buffer size = 8.50 MiB
Name and Version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: yes ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes version: 4202 (9f912511) built with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu
Operating systems
Linux, Ubuntu
Which llama.cpp modules do you know to be affected?
No response
Problem description & steps to reproduce
For some reason KV cache loads only into CPU RAM, not into GPU VRAM. llama.cpp compiled with CUDA support.
First Bad Commit
No response
Relevant log output