Closed nytopop closed 3 months ago
Does it depend on the quant type? Does it happen with F16 models?
I've tried Q4_K_M, Q8_0, and fp16. They all show the same issue.
Not reproduced with RTX 2060:
LLAMA_CUDA=1 make -j && ./parallel -ngl 400 -np 3 -ns 100 -fa -m models/llama-8b-v3-instruct/ggml-model-q4_k.gguf
I believe to have found the issue, please confirm whether this fix works: https://github.com/ggerganov/llama.cpp/pull/7904
Probably limited to cards without tensor cores, then. Looking at ggml-cuda/fattn.cu
: https://github.com/ggerganov/llama.cpp/blob/a9cae48003dfc4fe95b8f5c81682fc6e63425235/ggml-cuda/fattn.cu#L316-L323
I see corrupt output specifically when ggml_cuda_flash_attn_ext_vec_f32
is chosen. The tile
variant seems fine.
On the rocm machine i'm testing on, ggml_cuda_flash_attn_ext_vec_f16
is used and that produces normal output.
I believe to have found the issue, please confirm whether this fix works: #7904
Does seem to fix it, yes.
What happened?
When flash attention is enabled, generating parallel sequences in a batch produces corrupt output on some GPUs. It seems to only happen on certain batch sizes. I've tested with llama3 8b, mistral 7b, and qwen2 0.5b, all give similar output.
To reproduce:
./parallel -ngl 400 -np 3 -ns 10 -fa -m some-model.gguf
.The broken batch sizes and backends I've tried:
Name and Version
$ ./parallel --version version: 3138 (704a35b1) built with cc (Debian 13.2.0-25) 13.2.0 for x86_64-linux-gnu
What operating system are you seeing the problem on?
Linux
Relevant log output