[x] I reviewed the Discussions, and have a new bug or useful enhancement to share.
Feature Description
It would be nice if llama.cpp supported bf16 in convert.py, main, and quantize.
Motivation
There are a few motivations:
Many models such as llama3 are being trained in bf16 and released in bf16. If you want to run without quantization in llama.cpp, as far as I can tell, you need to convert to fp32, even though hardware is increasingly supporting bf16. Unfortunately, fp32 is not space efficient enough to run models like llama3-8b in the memory of 24GB VRAM GPUs. Even if it did fit, it would run slower.
Quantization accuracy: Some people mistake fp16 and bf16 as being interchangeable, so when quantizing, they export to fp16, thinking it is the same and then quantize to the desired quantization level. This is not quite the same as quantizing from fp32. However, anyone who does this by mistake and then uploads the result to hugging face is doing a disservice to the community without realizing it. Supporting bf16 should reduce instances of this mistake.
Possible Implementation
I assume this is straightforward in theory, although I am not familiar with the codebase to be able to implement it myself.
Addendum
I actually did search rather than merely check boxes. The most relevant existing issue was #6125 asking about bf16 support in convert.py.
As for discussions, I did not find anything particularly relevant, although one discussion on convert.py looked superficially relevant until I read it in detail.
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Feature Description
It would be nice if llama.cpp supported bf16 in
convert.py
,main
, andquantize
.Motivation
There are a few motivations:
Many models such as llama3 are being trained in bf16 and released in bf16. If you want to run without quantization in llama.cpp, as far as I can tell, you need to convert to fp32, even though hardware is increasingly supporting bf16. Unfortunately, fp32 is not space efficient enough to run models like llama3-8b in the memory of 24GB VRAM GPUs. Even if it did fit, it would run slower.
Quantization accuracy: Some people mistake fp16 and bf16 as being interchangeable, so when quantizing, they export to fp16, thinking it is the same and then quantize to the desired quantization level. This is not quite the same as quantizing from fp32. However, anyone who does this by mistake and then uploads the result to hugging face is doing a disservice to the community without realizing it. Supporting bf16 should reduce instances of this mistake.
Possible Implementation
I assume this is straightforward in theory, although I am not familiar with the codebase to be able to implement it myself.
Addendum
I actually did search rather than merely check boxes. The most relevant existing issue was #6125 asking about bf16 support in convert.py.
As for discussions, I did not find anything particularly relevant, although one discussion on convert.py looked superficially relevant until I read it in detail.