ggerganov / llama.cpp

LLM inference in C/C++
MIT License
67.6k stars 9.71k forks source link

Bug: [SYCL] bf16 model's performance is very slow in comparision to fp16 model's #9298

Closed qnixsynapse closed 3 weeks ago

qnixsynapse commented 2 months ago

What happened?

The performance of bf16 gguf is very slow (about 2 tokens/sec) in comparison to fp16 (36 tokens/sec) for Gemma 2B model.

This is probably because bf16 isn't supported in the SYCL backend. Enabling this will have higher precision but same number of bits as fp16:

https://github.com/ggerganov/llama.cpp/blob/master/ggml%2Fsrc%2Fggml-sycl.cpp#L5059-L5061

Arc GPUs support bf16.

Name and Version

Latest Master

What operating system are you seeing the problem on?

Linux

Relevant log output

INFO [           print_timings] generation eval time =    6641.88 ms /    15 runs   (  442.79 ms per token,     2.26 tokens per second) | tid="123530548281728" timestamp=1725374034 id_slot=0 id_task=0 t_token_generation=6641.883 n_decoded=15 t_token=442.7922 n_tokens_second=2.2583956989305594
JohannesGaessler commented 2 months ago

Should be the same root cause as https://github.com/ggerganov/llama.cpp/issues/8941 (but for a different backend).

Enabling this will have higher precision but same number of bits as fp16:

When I tested it I found no statistically significant difference between FP16 and BF16 in terms of their average ability to correctly predict the next token.

NeoZhangJianyu commented 2 months ago

@qnixsynapse Arc GPU support BF16 data type. But SYCL backend doesn't support BF16 now. It follows CUDA backend as #8941. :) There is no strong requirement for BF16 by now.

As our experience, Arc GPU has a little better performance of BF16 than FP16 in fact.

qnixsynapse commented 2 months ago

Most model weights available on HF are in bf16 so using the converter with the default options will keep them as bf16. Higher precision doesn't always mean "better next token prediction performance". It means less overflow errors/NaNs than fp16 and in this case, faster performance than fp32.

github-actions[bot] commented 3 weeks ago

This issue was closed because it has been inactive for 14 days since being marked as stale.