ryao commented 5 months ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

[x] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
[x] I carefully followed the README.md.
[x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[x] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

It would be nice if llama.cpp supported bf16 in convert.py, main, and quantize.

Motivation

There are a few motivations:

Many models such as llama3 are being trained in bf16 and released in bf16. If you want to run without quantization in llama.cpp, as far as I can tell, you need to convert to fp32, even though hardware is increasingly supporting bf16. Unfortunately, fp32 is not space efficient enough to run models like llama3-8b in the memory of 24GB VRAM GPUs. Even if it did fit, it would run slower.
Quantization accuracy: Some people mistake fp16 and bf16 as being interchangeable, so when quantizing, they export to fp16, thinking it is the same and then quantize to the desired quantization level. This is not quite the same as quantizing from fp32. However, anyone who does this by mistake and then uploads the result to hugging face is doing a disservice to the community without realizing it. Supporting bf16 should reduce instances of this mistake.

Possible Implementation

I assume this is straightforward in theory, although I am not familiar with the codebase to be able to implement it myself.

Addendum

I actually did search rather than merely check boxes. The most relevant existing issue was #6125 asking about bf16 support in convert.py.

As for discussions, I did not find anything particularly relevant, although one discussion on convert.py looked superficially relevant until I read it in detail.

sorasoras commented 5 months ago

https://github.com/ggerganov/llama.cpp/pull/6412

ryao commented 5 months ago

@sorasoras This is what I get for searching for bf16 instead of bfloat16. Thanks.

ggerganov / llama.cpp

bf16 support #6830

Prerequisites

Feature Description

Motivation

Possible Implementation

Addendum