Open Googulator opened 2 hours ago
From what you are describing, my conclusion would be that the ROCm version of cuBLAS is not deterministic.
In that case, it would be cublasGemmEx specifically, since forcing the cublasSgemm version results in deterministic output.
Looking at that CC check, it seems to be checking for tensor cores, which the RDNA family of GPUs indeed doesn't have, so using the cublasSgemm version makes more sense on RDNAx GPUs.
What's not clear is why without Flash Attention, even force-MMQ doesn't help.
The CC check is intended to determine if the GPU has fast enough F16 matrix multiplication that it may be worth converting the operands to F16, but that was written for NVIDIA GPUs, and I don't think that there was testing done on AMD hardware. Without flash attention, the matrix multiplications in the attention will be done with cuBLAS, so that should explain the difference.
What happened?
We are running llama-server on a Radeon RX 7900 XT, with the command line
./llama-server -t 4 -ngl 50 -c 13000 --host 0.0.0.0 --port 18080 --mlock -m mistral-nemo-instruct-2407-q8_0.gguf --chat-template llama2
.Upon calling the server repeatedly ("completion" endpoint) with the following JSON request:
...we get inconsistent output between calls, despite temperature being 0, and using a fixed seed.
We have found the following workarounds, which all result in deterministic output:
GGML_CUDA_FORCE_MMQ=1
and enabling Flash Attention (also a significant slowdown, though less than the previous one - neither change alone results in deterministic output, only their combination)ROCm version is 6.2.2 (running in a Docker container); the
amdgpu
kernel driver is the one supplied with Ubuntu kernel6.8.0-47-generic
(x86-64).Name and Version
$ ./llama-server --version ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 ROCm devices: Device 0: Radeon RX 7900 XT, compute capability 11.0, VMM: no version: 0 (unknown) built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
$ git show commit b8deef0ec0af5febac1d2cfd9119ff330ed0b762 (HEAD -> master, tag: b4034, origin/master, origin/HEAD) Author: Gabe Goodhart ghart@us.ibm.com Date: Tue Nov 5 05:23:04 2024 -0700
What operating system are you seeing the problem on?
Linux
Relevant log output
No response