llama : add Falcon LLM support

someone13574 commented 1 year ago

Falcon LLM 40b and 7b were just open sourced under a license which allows commercial use (~~with royalties for over $1 million revenue per year~~) and have are topping the Huggingface Open LLM leaderboard. It seems to be based on a modified gpt3 architecture. I’m wondering if support in llama.cpp would be considered.

https://huggingface.co/tiiuae/falcon-40b

cmp-nct commented 1 year ago

First we need to implement ggml Mind elaborating on that, it does not seem to make sense in context.

From what I read, I've not tested it, the model seems significantly better than llama, while it has a kind of shitty license for commercial growth (free until 1MM/y revenue, then 10%) it's better than illegal.

It's using flash attention and multiquery. gg already has branches with flashattention. I don't see that "implementation barrier" ?

cmp-nct commented 1 year ago

I've just invested almost an hour of prompting into Instruct Falcon 40B and it's significantly smarter than OpenAssisst 30B, despite being less well tuned. It is smarter than Turbo when it comes to some tests I ran, not as good as Turbo overall but I need to develop new tests now as Falcon-40B can beat all of those I currently had in the "Legacy/GPT-4 only" section.

dseddah commented 1 year ago

there's a guy who provided a q4b version of Falcon7B, would it be of some use for llama.cpp ?

https://github.com/Birch-san/falcon-play

cmp-nct commented 1 year ago

there's a guy who provided a q4b version of Falcon7B, would it be of some use for llama.cpp ?

https://github.com/Birch-san/falcon-play

Falcon has the full precision binaries available here: https://huggingface.co/tiiuae/falcon-40b/tree/main https://huggingface.co/tiiuae/falcon-40b-instruct https://huggingface.co/tiiuae/falcon-7b https://huggingface.co/tiiuae/falcon-7b-instruct https://huggingface.co/tiiuae/falcon-rw-1b

From there it should start, the pre-quantized versions are not useful imho.

I'm not 100% sure yet but from my tests I believe that we have a superior successor to llama at our hands that covers all our use cases (from small to large). I also tried some bias tests (given it's origin), the instruct Falcon 40B instruct is surprisingly unbiased, it felt like a bit of Turbo or GPT-4 "tuning" went into it 'As an AI model'. It remains to be tested and compared in detail of course.

It solved riddles Turbo, Alpaca and OpenAssist 30B can not solve.

Carefully said: It looks like the 40B Falcon might outperform the largest 65B llama (it does so in the benchmarks).

danmaxis commented 1 year ago

I don't know why I'm not able to convert it to .ggml, like other models.

Loading model file /mnt/m/llama_model/falcon-40b/pytorch_model-00009-of-00009.bin
Traceback (most recent call last):
  File "/home/danmaxis/llama_local/llamacpp_src/llama.cpp/./convert.py", line 1168, in <module>
    main()
  File "/home/danmaxis/llama_local/llamacpp_src/llama.cpp/./convert.py", line 1148, in main
    model_plus = load_some_model(args.model)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/danmaxis/llama_local/llamacpp_src/llama.cpp/./convert.py", line 1076, in load_some_model
    model_plus = merge_multifile_models(models_plus)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/danmaxis/llama_local/llamacpp_src/llama.cpp/./convert.py", line 583, in merge_multifile_models
    model = merge_sharded([mp.model for mp in models_plus])
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/danmaxis/llama_local/llamacpp_src/llama.cpp/./convert.py", line 562, in merge_sharded
    return {name: convert(name) for name in names}
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/danmaxis/llama_local/llamacpp_src/llama.cpp/./convert.py", line 562, in <dictcomp>
    return {name: convert(name) for name in names}
                  ^^^^^^^^^^^^^
  File "/home/danmaxis/llama_local/llamacpp_src/llama.cpp/./convert.py", line 537, in convert
    lazy_tensors: List[LazyTensor] = [model[name] for model in models]
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/danmaxis/llama_local/llamacpp_src/llama.cpp/./convert.py", line 537, in <listcomp>
    lazy_tensors: List[LazyTensor] = [model[name] for model in models]
                                      ~~~~~^^^^^^
KeyError: 'transformer.word_embeddings.weight'

KerfuffleV2 commented 1 year ago

@danmaxis

I don't know why I'm not able to convert it to .ggml, like other models.

Because it is a different type of model. LLaMA based models have a certain structure. Falcon is not based on LLaMA, there's a different set of tensors, the tensors have different names, etc.

The conversion app can't handle Falcon models yet.

jessejohnson commented 1 year ago

@danmaxis

I don't know why I'm not able to convert it to .ggml, like other models.

Because it is a different type of model. LLaMA based models have a certain structure. Falcon is not based on LLaMA, there's a different set of tensors, the tensors have different names, etc.

The conversion app can't handle Falcon models yet.

@KerfuffleV2 can you give me (us, really) an ELI5 of the LLaMA architecture and how it differs from, say GPT-3? Will be super grateful!

klosax commented 1 year ago

How much of all the work done in this repo could easily be transferred to future models and architectures?

It looks like the happy days of the original LLaMA models may soon be over, as it starts to get beaten by models with different architectures and more attractive licensing. Open LLM Leaderboard

As the flora of LLM architectures will continue to grow and new ones will replace the old, I think this repo and the LLM examples in the ggml repo should be merged into something like ggml_llm.

The ggml_llm would contain all the common LLM code (main inference / perplexity / filehandling / quantization / sampling ..) and the code for each architecture could be like plugins added at compile time. The gpt4all-backend may be a good starting point for how such structure could be built.

https://github.com/ggerganov/ggml/issues/185 https://github.com/ggerganov/ggml/pull/145#issuecomment-1544733902

KerfuffleV2 commented 1 year ago

@jessejohnson

can you give me (us, really) an ELI5 of the LLaMA architecture and how it differs from, say GPT-3?

I don't want to get too offtopic here so if you want detailed information you'd probably be better off creating a discussion. I also don't really know the specific architecture of GP-3, etc, so I can't tell you the exact way two specific types of model differ, just provide some general information.

This is a bit simplified, but a model consists of a bunch of tensors (just big arrays of numbers in various dimensions). The tensors generally have names, like transformer.word_embeddings.weight. Models also usually are set up with some main level tensors and then a set of tensors that are repeated in a number of layers. So you might have main_tensor and then layer.0.tensor1, layer.0.tensor2, layer.1.tensor1 etc. How the tensors are named depends on both the model architecture and the file format. GGML might call the same tensor a different thing from the HuggingFace format.

Anyway, to actually run a model one performs a bunch of math operations on those tensors. Some of the operations are simple like addition, multiplication, some are more complex and can have complicated logic internally like rope, alibi, matrix multiplication, etc.

Which tensors exist in a model and what sequence of those math operations are used to evaluate the model depends on the model architecture. While a LLaMA based model might have main_tensor + layer.0.tensor2 * layer.0.tensor1 * 1.321 a FALCON model might have layer.0.first.weight / (main_bias * 0.5) + layer.0.second.bias or whatever. I just made up completely random names there, they don't actually relate to anything.

The code in something like this project which evaluates a type of model it supports (say LLaMA for example) is set up to look for tensors with specific names, grab that data, perform the various operations in the correct order and then it also expects the result from those operations to be in a specific format as well.

Hopefully this makes it more clear why specific support needs to be added to ML tools to support models that actually have a different architecture.

jessejohnson commented 1 year ago

Thanks @KerfuffleV2, this is exactly what I was looking for!

cmp-nct commented 1 year ago

I took a look and Falcon is Bloom based, uses GPT-NeoX rot embeddings, gelu activation https://huggingface.co/tiiuae/falcon-40b/commit/e7950c40d6bc9caca678af160de9c79f33f93699 It looks like most of it is covered in https://github.com/NouamaneTazi/bloomz.cpp already.

Though looks like a bit of a nightmare to adapt everything :(

iHaagcom commented 1 year ago

I took a look and Falcon is Bloom based, uses GPT-NeoX rot embeddings, gelu activation https://huggingface.co/tiiuae/falcon-40b/commit/e7950c40d6bc9caca678af160de9c79f33f93699 It looks like most of it is covered in https://github.com/NouamaneTazi/bloomz.cpp already.

Though looks like a bit of a nightmare to adapt everything :(

Can bloomz.cpp run this model?

cmp-nct commented 1 year ago

Not without adaption, I've not looked into the differences (aside of the parameter and layer counts) but there certainly are some. Also bloomz is barebones, no GPU support, etc. It would be a nice first step to get it running there but llama.cpp is the platform with all the features.

real-andrew commented 1 year ago

while it has a kind of shitty license for commercial growth (free until 1MM/y revenue, then 10%) it's better than illegal.

As of 3 hours ago, they tweeted that they will forgo any royalties for commercial and research uses. I don't know what this means in practice but Falcon might become the first capable genuinly-opensource model we get.

logicchains commented 1 year ago

They've just updated their Huggingface to confirm that the models are now available under Apache 2.0: https://huggingface.co/tiiuae .

jessejohnson commented 1 year ago

According to their announcement on the official site, it's the Falcon 40B that is now under Apache 2.0. Not sure if they intend to do same for the smaller models, or if they plan an even larger, license-restricted one.

https://www.tii.ae/news/uaes-falcon-40b-worlds-top-ranked-ai-model-technology-innovation-institute-now-royalty-free

cmp-nct commented 1 year ago

They updated the main page, not the model pages yet. They are just a bit slow to follow up but it looks like we get a full open source model. Best thing ever exported from Abu Dhabi ?

Googulator commented 1 year ago

All models and datasets from them are now confirmed to be Apache 2.0. The model repositories still contain the old license.txt, but the models themselves are tagged Apache.

JohnAlcatraz commented 1 year ago

With Falcon-40B being significantly better than LLaMA-65B, and actually being fully open source under Apache 2.0, it's definitely the new king of open source LLMs. It would be great to see support for it in llama.cpp!

nikisalli commented 1 year ago

I was actually able to convert, quantize and load the model, but there is some tensor math to debug and modify but I have no 40GB gpu to debug the tensor values at each layer! so it produces garbage for now

I can give you the quantized model if you want to continue my work.

https://github.com/nikisalli/falcon.cpp

klosax commented 1 year ago

I was actually able to convert, quantize and load the model, but there is some tensor math to debug and modify but I have no 40GB gpu to debug the tensor values at each layer! so it produces garbage for now

Great work! Why dont you start with the 7B model instead? It should require less memory.

nikisalli commented 1 year ago

@klosax it is still too big! To debug the weights the model needs to be loaded in fp16 on the gpu. this means that a 24GB gpu is needed in the case of the 7B model and I do not posses one

ghost commented 1 year ago

Truthfully though the initial Falcon work should be done on 7B to ease development; I think the architecture is the same regardless of model size. If it gets traction I'm sure someone with a big GPU will hop in and help with the 40B :hugs:

Like it or not, Llama is limited by its legality and truly open models like Falcon are the way forwards for llama.cpp.

klosax commented 1 year ago

@nikisalli : On the model card it says "head_dim 64 Reduced to optimise for FlashAttention" but in the config.json the number is 128. Maybe try reducing it to 64?

Green-Sky commented 1 year ago

@nikisalli what do you need the gpu for? why not cpu?, ggml/llama.cpp is known for its ability to run on cpu after all...

nikisalli commented 1 year ago

I find it useful to run the pytorch model with many print statements here and there to check that ggml is giving me the same numbers so that I know what operations to touch

Green-Sky commented 1 year ago

OH, you are running the python one. my bed. but still, should be able to force cpu mode.

nikisalli commented 1 year ago

nope :( some layers are not implemented for cpu and half precision!

FNsi commented 1 year ago

It's bf16 and I can't run it in my device too.

cmp-nct commented 1 year ago

I also struggled, didn't get it to run yet. There are significant differences in the attention/kqv handling between 7B and 40B:

Without multi_query (40B):

        self.query_key_value = Linear(
            self.hidden_size,
            (config.n_head_kv * 2 + config.n_head) * self.head_dim,
            bias=config.bias,
        )
        self.dense = Linear(self.hidden_size, self.hidden_size, bias=config.bias)
        self.attention_dropout = nn.Dropout(config.attention_dropout)
        self.num_kv = config.n_head_kv

With multi_query (7B):

        self.query_key_value = Linear(
            self.hidden_size,
            3 * self.hidden_size if not config.multi_query else (self.hidden_size + 2 * self.head_dim),
            bias=config.bias,
        )
        self.multi_query = config.multi_query
        self.dense = Linear(self.hidden_size, self.hidden_size, bias=config.bias)
        self.attention_dropout = nn.Dropout(config.attention_dropout)
        self.num_kv = 1

The relevant config for both: Config without multiquery (40B):

  "hidden_size": 8192,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "RefinedWeb",
  "n_head": 128,
  "n_head_kv": 8,
  "n_layer": 60,
  "parallel_attn": true,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.27.4",
  "use_cache": true,
  "vocab_size": 65024

Config with multiquery (7B):

 "hidden_size": 4544,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "RefinedWebModel",
  "multi_query": true,
  "n_head": 71,
  "n_layer": 32,
  "parallel_attn": true,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.27.4",
  "use_cache": true,
  "vocab_size": 65024

In the conversion python module for 7B we'll also need the conv_map changed: 'input_layernorm' : 'attention_norm', # 7B The handling of k,q,v re-shape is also different for both

dseddah commented 1 year ago

I was actually able to convert, quantize and load the model, but there is some tensor math to debug and modify but I have no 40GB gpu to debug the tensor values at each layer! so it produces garbage for now

I can give you the quantized model if you want to continue my work.

https://github.com/nikisalli/falcon.cpp

Hi, I'll be super happy to have access to your quantized version of the 40B model if you can share it.

FNsi commented 1 year ago

Be aware below is not useful anymore, pls use apaga43's ggml example one.

~~I did edit a script convert 7b falcon to ggml, and the quantised part original from bloomz.cpp works fine. but need someone modify main.cpp and load it to check.convert-7b-falcon~~ ~~Reason use the "none"~~

if alibi is None:
            query_layer_ = query_layer.reshape(batch_size, self.num_heads, -1, self.head_dim)
            key_layer_ = key_layer.reshape(batch_size, self.num_kv, -1, self.head_dim)
            value_layer_ = value_layer.reshape(batch_size, self.num_kv, -1, self.head_dim)

            attn_output = F.scaled_dot_product_attention(
                query_layer_, key_layer_, value_layer_, None, 0.0, is_causal=True
            )

~~Update: the problem I am facing is the element number of WV, or call it whatever is equal to tensor.ne[1] * tensor.ne[2] * tensor.ne[3],~~

~~which i think it's not right because the others are~~

~~tensor.ne[0] * tensor.ne[1] * tensor.ne[2] * tensor.ne[3]~~

~~If I force to load it, there's an iostream mistake which eventually will cause segment fall.~~

~~Update: Did the same trick as niksalli did, now it can be load, but only load, i think its because of attention, OR the ne[] sequence should be reversed or only because I haven't edited all functions like the modelrw.py yet😂 feel free to .... guess? (I changed python modified python script too, but I won't change it here now, it still maybe right. if you want to use that main.cpp, just delete wq wk wv. falcon-7b-main.cpp falcon-fall ~~

~~Update: fix ugly hack like 9216 or 4672 to "2*n_embd/n_heads + n_embd "but still repeat. ~~

dseddah commented 1 year ago

Thanks.

nikisalli commented 1 year ago

Hi, I'll be super happy to have access to your quantized version of the 40B model if you can share it.

how can I share it with you?

d0rc commented 1 year ago

upload it to huggingface?

zcourts commented 1 year ago

@nikisalli I don't have the expertise to continue your work myself right now but if you're open to it I can get you access to a large GPU server to continue your work? We have access to V100 and V100s - I think they have 90GB if I recall right

maddes8cht commented 1 year ago

Does anyone have any information about the context size of the falcon models? I couldn't find anything, besides a tweet claiming a context size of 8192 https://twitter.com/max_paperclips/status/1662208170247467009 but with no other source than a link to the huggingface repo.

klosax commented 1 year ago

Does anyone have any information about the context size of the falcon models? I couldn't find anything, besides a tweet claiming a context size of 8192 https://twitter.com/max_paperclips/status/1662208170247467009 but with no other source than a link to the huggingface repo.

https://huggingface.co/tiiuae/falcon-40b#model-architecture-and-objective

Sequence length: 2048

Sequence length should be the max useful context size value.

jploski commented 1 year ago

Note that just like for MPT previously (https://github.com/ggerganov/ggml/pull/145), the implementation approach should be to first create a working inference for this model in the ggml repository (https://github.com/ggerganov/ggml/tree/master/examples). This will require porting of modelling_RW.py released with the Falcon-7B model, which implements the model architecture/algorithms (there's already an open issue for that: https://github.com/ggerganov/ggml/issues/217)

FWIW, just like I had previously done with MPT, I published a nerfed miniature low-memory version of this model trained on the tinyshakespeare dataset: https://huggingface.co/jploski/falcon-mini-shakespeare (you can see from config.js that I removed most layers, attention heads, and changed hidden dim to 128, creating a model with <10M parameters).

ichsan2895 commented 1 year ago

I did edit a script convert 7b falcon to ggml, and the quantised part original from bloomz.cpp works fine. but need someone to modify main.cpp load it to check.convert-7b-falcon

Reason use the "none"
if alibi is None:
            query_layer_ = query_layer.reshape(batch_size, self.num_heads, -1, self.head_dim)
            key_layer_ = key_layer.reshape(batch_size, self.num_kv, -1, self.head_dim)
            value_layer_ = value_layer.reshape(batch_size, self.num_kv, -1, self.head_dim)

            attn_output = F.scaled_dot_product_attention(
                query_layer_, key_layer_, value_layer_, None, 0.0, is_causal=True
            )
Update: probably shouldn't just keep q, k, v ???

Thank you, I have recently successfully to create ggml with Falcon-7B with your code convert-7b-falcon

I ran it with this command until it finished

python3 convert-falcon-7b-to-ggml.py tiiuae/falcon-7b ./models

But, unfortunatelly, it won't run

$ ./main -m ./models/ggml-model-falcon-7b-f16.bin -t 8 -n 128

main: seed = 1685732638
bloom_model_load: loading model from './models/ggml-model-falcon-7b-f16.bin' - please wait ...
bloom_model_load: n_vocab = 65024
bloom_model_load: n_ctx   = 512
bloom_model_load: n_embd  = 4544
bloom_model_load: n_mult  = 1
bloom_model_load: n_head  = 71
bloom_model_load: n_layer = 32
bloom_model_load: f16     = 1
bloom_model_load: n_ff    = 18176
bloom_model_load: n_parts = 1
bloom_model_load: ggml ctx size = 15595.67 MB
bloom_model_load: memory_size =   568.00 MB, n_mem = 16384
bloom_model_load: loading model part 1/1 from './models/ggml-model-falcon-7b-f16.bin'
bloom_model_load: tok_embeddings.weight[0] = 4544 x 65024
layers.0.attention_norm.weight[0] = 4544
layers.0.attention_norm.bias[0] = 4544
bloom_model_load: unknown tensor 'layers.0.attention.wv.weight' in model file
main: failed to load model from './models/ggml-model-falcon-7b-f16.bin'

lantiga commented 1 year ago

Hey we just ported Falcon 7B and 40B to lit-parrot (single-file implementation), feel free to take a look

FNsi commented 1 year ago

But, unfortunatelly, it won't run

you need modify the main.cpp to load it, ~~also wired cause the bias part tuple is not the same when I was converting~~

nikisalli commented 1 year ago

Hey we just ported Falcon 7B and 40B to lit-parrot (single-file implementation), feel free to take a look

from what I see it's just a wrapper of torch and gptq. I don't understand what's the useful information for this port in it?

nikisalli commented 1 year ago

by the way, I found a setup I'm comfortable with for testing and comparing my ggml implementation of falcon 40B with the torch one (I basically load just one block to run it on my silly little gtx1050 4gb) I found that falcon uses some tricky tensor math in its attention head splitting implementation, it will take some time to get it right since it is a mess of many 4d views.

lantiga commented 1 year ago

Hey we just ported Falcon 7B and 40B to lit-parrot (single-file implementation), feel free to take a look

from what I see it's just a wrapper of torch and gptq. I don't understand what's the useful information for this port in it?

I just thought the model implementation could be useful to look at as an extra source. Sorry for the noise if it’s not of help.

someone13574 commented 1 year ago

Would support directly in llama.cpp be considered or would it be best to keep llama.cpp only in support of llama models and keep other architectures in separate projects or forks?

cmp-nct commented 1 year ago

Would support directly in llama.cpp be considered or would it be best to keep llama.cpp only in support of llama models and keep other architectures in separate projects or forks?

That's a call @ggerganov has to make, I'd guess first a fork later a smarter ggllm version. However, first we need the two models running .. and that would be best to be done by someone who's quite experienced in Torch and ggml. I gave it a failed try, I don't know how to get the multi query properly in and debugging issues is just a nightmare.

The only positive side is that multi query looks worthwhile in general to support, so once we have it we can also train and tune for it.

ichsan2895 commented 1 year ago

But, unfortunatelly, it won't run

you need modify the main.cpp to load it, ~also wired cause the bias part tuple is not the same when I was converting~

How to do that? Which code that I must modify? Sorry I am noob for this..

FNsi commented 1 year ago

How to do that? Which code that I must modify? Sorry I am noob for this..

Emm I don't know neither, still working on ggml.c 😂

iHaagcom commented 1 year ago

model

Did you try using bloomz.cpp?

ggerganov / llama.cpp

llama : add Falcon LLM support #1602