ggerganov / llama.cpp

LLM inference in C/C++
MIT License
64.8k stars 9.29k forks source link

Support StableLM From StabilityAI #1063

Closed MarkSchmidty closed 1 year ago

MarkSchmidty commented 1 year ago

Blog Post Announcement (It may be using the same architecture as GPT-NeoX)

Launch In Colab

GitHub Repo

In case these links 404 due to being posted early by accident: https://archive.is/ZQszO https://archive.ph/U0Pr8

(Checkpoint links are Hugging Face repos with model weights) Size StableLM-Base-Alpha StableLM-Tuned-Alpha Training Tokens [in progress] Context Window Web Demo
3B checkpoint checkpoint 800B [1.5T]* 4096
7B checkpoint checkpoint 800B [1.5T]* 4096 HuggingFace
15B (in progress) (pending) 1.5T*
30B (in progress) (pending) 1.5T*
65B (in progress) (pending) 1.5T*
175B (planned)

*3T Planned

Green-Sky commented 1 year ago

are they just new GPT-NeoX models? or did they forget to update the model cards on hf ? :smile:

Green-Sky commented 1 year ago

related https://github.com/ggerganov/ggml/issues/10

jessejohnson commented 1 year ago

This was quick! 😅

They've included a bit in the ReadMe indicating that compatibility with llama.cpp is actively desired. :)

EDIT: related HN thread https://news.ycombinator.com/item?id=35629127

NoNamedCat commented 1 year ago

This models will be compatible with llama.cpp?

rabidcopy commented 1 year ago

Definitely interested in this. Interesting that they specifically highlight wanting llama.cpp/ggml support.

rabidcopy commented 1 year ago

If it really is GPT NeoX, this repo has conversion, quantization, and support for basic inference for GPT NeoX and other model formats. https://github.com/NolanoOrg/cformers/blob/master/cformers/cpp/converters/convert_gptneox_to_ggml.py https://github.com/NolanoOrg/cformers/blob/master/cformers/cpp/quantize_gptneox.cpp

ggerganov commented 1 year ago

Here is a very quick and dirty implementation using ggml:

https://github.com/ggerganov/ggml/pull/96

Also, found a bug in multi-threaded ggml_cpy():

https://github.com/ggerganov/ggml/pull/96/files#diff-b4a500ab2765c31526c5541f3e51e21e46990b87d9774cac6f3089db315bdc5bR5655-R5660

acheong08 commented 1 year ago

are they just new GPT-NeoX models? or did they forget to update the model cards on hf ? smile

Is it?

MarkSchmidty commented 1 year ago

Yes it's using GPT-NeoX architecture. The model details can be seen here: https://github.com/Stability-AI/StableLM/blob/main/configs/stablelm-base-alpha-7b.yaml

  # model settings
  "num-layers": 16,
  "hidden-size": 6144,
  "num-attention-heads": 48,
  "seq-length": 4096,
  "max-position-embeddings": 4096,

  # architecture design
  "norm": "layernorm",
  "pos-emb": "rotary",
  "rotary_pct": 0.25,
  "activation": "gelu",
  "no-weight-tying": true,
  "gpt_j_residual": true,
  "output_layer_parallelism": "column",
ggerganov commented 1 year ago

Merged in ggml: https://github.com/ggerganov/ggml/tree/master/examples/stablelm

mhkhung commented 1 year ago

The q4_x files output from ggml are not compatible with llama.cpp?

fgdfgfthgr-fox commented 1 year ago

The q4_x files output from ggml are not compatible with llama.cpp?

It seems so currently.

magicrobotmonkey commented 1 year ago

I've converted/quantized stablelm-tuned-alpha-7b to Q4_3 and it works great with ggml, but llama.cpp throws error loading model: missing tok_embeddings.weight, seems like some support is missing.

AndreiSva commented 1 year ago

I am getting the same error

mikeggh commented 1 year ago

Are you using the specific binary for stablelm? It seems separated from the looks of it in https://github.com/ggerganov/ggml/tree/master/examples/stablelm

wkkautas commented 1 year ago

Are there plan to integrate ggml/examples/stablelm into llama.cpp? Also it would be great if a single llama.cpp binary is able to use also gpt-2 and gpt-j.

ggerganov commented 1 year ago

There seems to be a bug in the existing StableLM implementation in ggml. See the updated README for more details:

https://github.com/ggerganov/ggml/tree/master/examples/stablelm#warning

Best way to fix this is to compare outputs with the reference implementation. Any help will be appreciated.

ggerganov commented 1 year ago

So, I ran the HF transformers implementation and I observe the same "increasing magnitude" behaviour as in the ggml implementation.

To do this, I changed the following line:

https://github.com/huggingface/transformers/blob/c2c99dc7ef5edab8f7674a1eb00cf6ac6996fd0f/src/transformers/models/gpt_neox/modeling_gpt_neox.py#L234

to:

        print(attn_scores);
        attn_weights = nn.functional.softmax(attn_scores, dim=-1)

Here is the output log from a sample run:

softmax-stablelm.txt

For comparison, here is running GPT-2 using HF transformers with the same change:

softmax-gpt-2.txt

Notice how the GPT-2 values are all well below 1e1 for each layer, while the StableLM numbers jump all the way up to 1e3. The GPT-2 behaviour is also observed for GPT-J and LLaMA models (these are the models that I currently play with the most). To me, it kind of makes sense to be this way and it seems to be correct, while the StableLM numbers are weird.


So is my understanding incorrect or is there something wrong with the StableLM model? In any case, I no longer think there is a bug in the ggml implementation.

byroneverson commented 1 year ago

I believe this behavior is correct and is a result of how the models were trained. The text output seems to be coherent and the values only rarely converge to -inf. I may be out of line, but is it possible this is normal? I will continue to look further into this but I doubt softmax would work at all if this was a major issue. If you have any further insight I would love to dive deeper.

ggerganov commented 1 year ago

is it possible this is normal?

Absolutely. It's just my intuitive understanding that the scaling before the soft max layer has the purpose of preventing exactly this kind of magnitude increase. But I could be wrong and this is fine.