Open a-downing opened 2 months ago
The error message is unhelpful but this is just a consequence of there being no CUDA implementation for BF16. For 0 GPU layers the GPU is used for prompt processing only if the prompt is at least 32 tokens long so that is most likely why it only crashes sometimes.
The error message is unhelpful but this is just a consequence of there being no CUDA implementation for BF16. For 0 GPU layers the GPU is used for prompt processing only if the prompt is at least 32 tokens long so that is most likely why it only crashes sometimes.
Ahh gotcha. Looking at the command line options I didn't notice anything to disable prompt processing on the GPU. Is this the case?
There is no CLI option, you have to compile without flags like LLAMA_CUDA
.
I am facing the same issue with the same model - see #7223 (now closed). Is bf16 unsupported by CUDA in general, or only by llama.cpp?
EDIT: FYI, it works for me if I stay under batch size 32, such as with flag -b 16
. But this is a bit slow.
It's a llama.cpp issue.
+1 for bf16 CUDA llama.cpp support
please add
I'm currently creating an imatrix from f32 that will end up taking about two days total. I assume bf16 would have cut that in half
For Rocm support too
Can anyone share quick fix for it ?
-b 16
for llamacpp is not works for me...
There is no fix. CUDA support for BF16 is simply not implemented. Your only options are to either implement it yourself or to use a llama.cpp version without CUDA.
I was trying this model https://huggingface.co/ddh0/Meta-Llama-3-8B-Instruct-bf16-GGUF
By varying the prompt it randomly works sometimes. When offloading layers to the GPU it seems to crash no matter the prompt.