containers / ai-lab-recipes

Examples for building and running LLM services and applications locally with Podman
Apache License 2.0
103 stars 106 forks source link

llama_cpp_python server > 0.2.79 breaks the vulkan image #742

Open lstocchi opened 1 month ago

lstocchi commented 1 month ago

When you build the vulkan image using llama_cpp_python 0.2.79 you see that it is actually able to detect and use gpu bc in the logs you can find

ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: Virtio-GPU Venus (Apple M2 Pro) (venus) | uma: 1 | fp16: 1 | warp size: 32
llm_load_tensors: ggml ctx size =  0.30 MiB
warning: failed to mlock 73732096-byte buffer (after previously locking 0 bytes): Cannot allocate memory
Try increasing RLIMIT_MEMLOCK ('ulimit -l' as root).
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:    CPU buffer size =  70.31 MiB
llm_load_tensors:  Vulkan0 buffer size = 4095.05 MiB
.................................................................................................

However starting from 0.2.80+ there is something broken and the gpu detection/usage is completely skipped. In the logs you just find

...
llm_load_tensors:    CPU buffer size = 4165.37 MiB
...

I also tested with the latest version 0.2.87 and it is still broken. Now we're using 0.2.85 -> https://github.com/containers/ai-lab-recipes/blob/main/model_servers/llamacpp_python/src/requirements.txt#L1