Open lstocchi opened 1 month ago
When you build the vulkan image using llama_cpp_python 0.2.79 you see that it is actually able to detect and use gpu bc in the logs you can find
ggml_vulkan: Found 1 Vulkan devices: Vulkan0: Virtio-GPU Venus (Apple M2 Pro) (venus) | uma: 1 | fp16: 1 | warp size: 32 llm_load_tensors: ggml ctx size = 0.30 MiB warning: failed to mlock 73732096-byte buffer (after previously locking 0 bytes): Cannot allocate memory Try increasing RLIMIT_MEMLOCK ('ulimit -l' as root). llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 70.31 MiB llm_load_tensors: Vulkan0 buffer size = 4095.05 MiB .................................................................................................
However starting from 0.2.80+ there is something broken and the gpu detection/usage is completely skipped. In the logs you just find
... llm_load_tensors: CPU buffer size = 4165.37 MiB ...
I also tested with the latest version 0.2.87 and it is still broken. Now we're using 0.2.85 -> https://github.com/containers/ai-lab-recipes/blob/main/model_servers/llamacpp_python/src/requirements.txt#L1
When you build the vulkan image using llama_cpp_python 0.2.79 you see that it is actually able to detect and use gpu bc in the logs you can find
However starting from 0.2.80+ there is something broken and the gpu detection/usage is completely skipped. In the logs you just find
I also tested with the latest version 0.2.87 and it is still broken. Now we're using 0.2.85 -> https://github.com/containers/ai-lab-recipes/blob/main/model_servers/llamacpp_python/src/requirements.txt#L1