fix: revert llama cpp python server to 0.2.79 to enable gpu

lstocchi commented 1 month ago

What does this PR do?

It just reverts the llama cpp python server bc the 0.2.79 is the last version that actually works fine with vulkan

Screenshot / video of UI

N/A

What issues does this PR fix or reference?

it resolves #40

How to test this PR?

run the latest version of the vulkan image

ghcr.io/containers/podman-desktop-extension-ai-lab-playground-images/ai-lab-playground-chat-vulkan:62b6f628ed77cf3f1518c32746e2e89d27072f0e

and verify that it actually uses the cpu. The gpu detection is completely skipped. You can use this command (update the model path)

podman run --device /dev/dri --mount type=bind,src=/Users/luca/.local/share/containers/podman-desktop/extensions-storage/redhat.ai-lab/models/hf.TheBloke.mistral-7b-instruct-v0.2.Q4_K_M/,target=/models/ -e MODEL_PATH=/models/mistral-7b-instruct-v0.2.Q4_K_M.gguf -e GPU_LAYERS=-1 ghcr.io/containers/podman-desktop-extension-ai-lab-playground-images/ai-lab-playground-chat-vulkan:62b6f628ed77cf3f1518c32746e2e89d27072f0e

In the logs you should just have

...
llm_load_tensors:    CPU buffer size = 4165.37 MiB
...

build a new image using llama cpp 0.2.79 and run it. Now you should see the actual logs that shows the gpu is being used

ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: Virtio-GPU Venus (Apple M2 Pro) (venus) | uma: 1 | fp16: 1 | warp size: 32
llm_load_tensors: ggml ctx size =  0.30 MiB
warning: failed to mlock 73732096-byte buffer (after previously locking 0 bytes): Cannot allocate memory
Try increasing RLIMIT_MEMLOCK ('ulimit -l' as root).
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:    CPU buffer size =  70.31 MiB
llm_load_tensors:  Vulkan0 buffer size = 4095.05 MiB
.................................................................................................

2-b. if you do not want to build your own images you can use these below for testing using different version of llama_cpp quay.io/lstocchi/vulkan:v4_279 -> llama_cpp 0.2.79 quay.io/lstocchi/vulkan:v4_280 -> llama_cpp 0.2.80 quay.io/lstocchi/vulkan:v4_284 -> llama_cpp 0.2.84 ghcr.io/containers/podman-desktop-extension-ai-lab-playground-images/ai-lab-playground-chat-vulkan:62b6f628ed77cf3f1518c32746e2e89d27072f0e -> llamacpp 0.2.85 quay.io/lstocchi/vulkan:v4_287 -> llama_cpp 0.2.87

axel7083 commented 1 month ago

Is there an issue upstream to link ?

lstocchi commented 1 month ago

Was opening it -> https://github.com/containers/ai-lab-recipes/issues/742

containers / podman-desktop-extension-ai-lab-playground-images