Closed qnixsynapse closed 3 weeks ago
Should be the same root cause as https://github.com/ggerganov/llama.cpp/issues/8941 (but for a different backend).
Enabling this will have higher precision but same number of bits as fp16:
When I tested it I found no statistically significant difference between FP16 and BF16 in terms of their average ability to correctly predict the next token.
@qnixsynapse Arc GPU support BF16 data type. But SYCL backend doesn't support BF16 now. It follows CUDA backend as #8941. :) There is no strong requirement for BF16 by now.
As our experience, Arc GPU has a little better performance of BF16 than FP16 in fact.
Most model weights available on HF are in bf16 so using the converter with the default options will keep them as bf16. Higher precision doesn't always mean "better next token prediction performance". It means less overflow errors/NaNs than fp16 and in this case, faster performance than fp32.
This issue was closed because it has been inactive for 14 days since being marked as stale.
What happened?
The performance of bf16 gguf is very slow (about 2 tokens/sec) in comparison to fp16 (36 tokens/sec) for Gemma 2B model.
This is probably because bf16 isn't supported in the SYCL backend. Enabling this will have higher precision but same number of bits as fp16:
https://github.com/ggerganov/llama.cpp/blob/master/ggml%2Fsrc%2Fggml-sycl.cpp#L5059-L5061
Arc GPUs support bf16.
Name and Version
Latest Master
What operating system are you seeing the problem on?
Linux
Relevant log output