Closed someone13574 closed 1 year ago
First we need to implement ggml Mind elaborating on that, it does not seem to make sense in context.
From what I read, I've not tested it, the model seems significantly better than llama, while it has a kind of shitty license for commercial growth (free until 1MM/y revenue, then 10%) it's better than illegal.
It's using flash attention and multiquery. gg already has branches with flashattention. I don't see that "implementation barrier" ?
I've just invested almost an hour of prompting into Instruct Falcon 40B and it's significantly smarter than OpenAssisst 30B, despite being less well tuned. It is smarter than Turbo when it comes to some tests I ran, not as good as Turbo overall but I need to develop new tests now as Falcon-40B can beat all of those I currently had in the "Legacy/GPT-4 only" section.
there's a guy who provided a q4b version of Falcon7B, would it be of some use for llama.cpp ?
there's a guy who provided a q4b version of Falcon7B, would it be of some use for llama.cpp ?
Falcon has the full precision binaries available here: https://huggingface.co/tiiuae/falcon-40b/tree/main https://huggingface.co/tiiuae/falcon-40b-instruct https://huggingface.co/tiiuae/falcon-7b https://huggingface.co/tiiuae/falcon-7b-instruct https://huggingface.co/tiiuae/falcon-rw-1b
From there it should start, the pre-quantized versions are not useful imho.
I'm not 100% sure yet but from my tests I believe that we have a superior successor to llama at our hands that covers all our use cases (from small to large). I also tried some bias tests (given it's origin), the instruct Falcon 40B instruct is surprisingly unbiased, it felt like a bit of Turbo or GPT-4 "tuning" went into it 'As an AI model'. It remains to be tested and compared in detail of course.
It solved riddles Turbo, Alpaca and OpenAssist 30B can not solve.
Carefully said: It looks like the 40B Falcon might outperform the largest 65B llama (it does so in the benchmarks).
I don't know why I'm not able to convert it to .ggml, like other models.
Loading model file /mnt/m/llama_model/falcon-40b/pytorch_model-00009-of-00009.bin
Traceback (most recent call last):
File "/home/danmaxis/llama_local/llamacpp_src/llama.cpp/./convert.py", line 1168, in <module>
main()
File "/home/danmaxis/llama_local/llamacpp_src/llama.cpp/./convert.py", line 1148, in main
model_plus = load_some_model(args.model)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/danmaxis/llama_local/llamacpp_src/llama.cpp/./convert.py", line 1076, in load_some_model
model_plus = merge_multifile_models(models_plus)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/danmaxis/llama_local/llamacpp_src/llama.cpp/./convert.py", line 583, in merge_multifile_models
model = merge_sharded([mp.model for mp in models_plus])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/danmaxis/llama_local/llamacpp_src/llama.cpp/./convert.py", line 562, in merge_sharded
return {name: convert(name) for name in names}
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/danmaxis/llama_local/llamacpp_src/llama.cpp/./convert.py", line 562, in <dictcomp>
return {name: convert(name) for name in names}
^^^^^^^^^^^^^
File "/home/danmaxis/llama_local/llamacpp_src/llama.cpp/./convert.py", line 537, in convert
lazy_tensors: List[LazyTensor] = [model[name] for model in models]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/danmaxis/llama_local/llamacpp_src/llama.cpp/./convert.py", line 537, in <listcomp>
lazy_tensors: List[LazyTensor] = [model[name] for model in models]
~~~~~^^^^^^
KeyError: 'transformer.word_embeddings.weight'
@danmaxis
I don't know why I'm not able to convert it to .ggml, like other models.
Because it is a different type of model. LLaMA based models have a certain structure. Falcon is not based on LLaMA, there's a different set of tensors, the tensors have different names, etc.
The conversion app can't handle Falcon models yet.
@danmaxis
I don't know why I'm not able to convert it to .ggml, like other models.
Because it is a different type of model. LLaMA based models have a certain structure. Falcon is not based on LLaMA, there's a different set of tensors, the tensors have different names, etc.
The conversion app can't handle Falcon models yet.
@KerfuffleV2 can you give me (us, really) an ELI5 of the LLaMA architecture and how it differs from, say GPT-3? Will be super grateful!
How much of all the work done in this repo could easily be transferred to future models and architectures?
It looks like the happy days of the original LLaMA models may soon be over, as it starts to get beaten by models with different architectures and more attractive licensing. Open LLM Leaderboard
As the flora of LLM architectures will continue to grow and new ones will replace the old, I think this repo and the LLM examples in the ggml repo should be merged into something like ggml_llm.
The ggml_llm would contain all the common LLM code (main inference / perplexity / filehandling / quantization / sampling ..) and the code for each architecture could be like plugins added at compile time. The gpt4all-backend may be a good starting point for how such structure could be built.
https://github.com/ggerganov/ggml/issues/185 https://github.com/ggerganov/ggml/pull/145#issuecomment-1544733902
@jessejohnson
can you give me (us, really) an ELI5 of the LLaMA architecture and how it differs from, say GPT-3?
I don't want to get too offtopic here so if you want detailed information you'd probably be better off creating a discussion. I also don't really know the specific architecture of GP-3, etc, so I can't tell you the exact way two specific types of model differ, just provide some general information.
This is a bit simplified, but a model consists of a bunch of tensors (just big arrays of numbers in various dimensions). The tensors generally have names, like transformer.word_embeddings.weight
. Models also usually are set up with some main level tensors and then a set of tensors that are repeated in a number of layers. So you might have main_tensor
and then layer.0.tensor1
, layer.0.tensor2
, layer.1.tensor1
etc. How the tensors are named depends on both the model architecture and the file format. GGML might call the same tensor a different thing from the HuggingFace format.
Anyway, to actually run a model one performs a bunch of math operations on those tensors. Some of the operations are simple like addition, multiplication, some are more complex and can have complicated logic internally like rope, alibi, matrix multiplication, etc.
Which tensors exist in a model and what sequence of those math operations are used to evaluate the model depends on the model architecture. While a LLaMA based model might have main_tensor + layer.0.tensor2 * layer.0.tensor1 * 1.321
a FALCON model might have layer.0.first.weight / (main_bias * 0.5) + layer.0.second.bias
or whatever. I just made up completely random names there, they don't actually relate to anything.
The code in something like this project which evaluates a type of model it supports (say LLaMA for example) is set up to look for tensors with specific names, grab that data, perform the various operations in the correct order and then it also expects the result from those operations to be in a specific format as well.
Hopefully this makes it more clear why specific support needs to be added to ML tools to support models that actually have a different architecture.
Thanks @KerfuffleV2, this is exactly what I was looking for!
I took a look and Falcon is Bloom based, uses GPT-NeoX rot embeddings, gelu activation https://huggingface.co/tiiuae/falcon-40b/commit/e7950c40d6bc9caca678af160de9c79f33f93699 It looks like most of it is covered in https://github.com/NouamaneTazi/bloomz.cpp already.
Though looks like a bit of a nightmare to adapt everything :(
I took a look and Falcon is Bloom based, uses GPT-NeoX rot embeddings, gelu activation https://huggingface.co/tiiuae/falcon-40b/commit/e7950c40d6bc9caca678af160de9c79f33f93699 It looks like most of it is covered in https://github.com/NouamaneTazi/bloomz.cpp already.
Though looks like a bit of a nightmare to adapt everything :(
Can bloomz.cpp run this model?
Not without adaption, I've not looked into the differences (aside of the parameter and layer counts) but there certainly are some. Also bloomz is barebones, no GPU support, etc. It would be a nice first step to get it running there but llama.cpp is the platform with all the features.
while it has a kind of shitty license for commercial growth (free until 1MM/y revenue, then 10%) it's better than illegal.
As of 3 hours ago, they tweeted that they will forgo any royalties for commercial and research uses. I don't know what this means in practice but Falcon might become the first capable genuinly-opensource model we get.
They've just updated their Huggingface to confirm that the models are now available under Apache 2.0: https://huggingface.co/tiiuae .
According to their announcement on the official site, it's the Falcon 40B that is now under Apache 2.0. Not sure if they intend to do same for the smaller models, or if they plan an even larger, license-restricted one.
They updated the main page, not the model pages yet. They are just a bit slow to follow up but it looks like we get a full open source model. Best thing ever exported from Abu Dhabi ?
All models and datasets from them are now confirmed to be Apache 2.0. The model repositories still contain the old license.txt, but the models themselves are tagged Apache.
With Falcon-40B being significantly better than LLaMA-65B, and actually being fully open source under Apache 2.0, it's definitely the new king of open source LLMs. It would be great to see support for it in llama.cpp!
I was actually able to convert, quantize and load the model, but there is some tensor math to debug and modify but I have no 40GB gpu to debug the tensor values at each layer! so it produces garbage for now
I can give you the quantized model if you want to continue my work.
I was actually able to convert, quantize and load the model, but there is some tensor math to debug and modify but I have no 40GB gpu to debug the tensor values at each layer! so it produces garbage for now
Great work! Why dont you start with the 7B model instead? It should require less memory.
@klosax it is still too big! To debug the weights the model needs to be loaded in fp16 on the gpu. this means that a 24GB gpu is needed in the case of the 7B model and I do not posses one
Truthfully though the initial Falcon work should be done on 7B to ease development; I think the architecture is the same regardless of model size. If it gets traction I'm sure someone with a big GPU will hop in and help with the 40B :hugs:
Like it or not, Llama is limited by its legality and truly open models like Falcon are the way forwards for llama.cpp.
@nikisalli : On the model card it says "head_dim 64 Reduced to optimise for FlashAttention" but in the config.json the number is 128. Maybe try reducing it to 64?
@nikisalli what do you need the gpu for? why not cpu?, ggml/llama.cpp is known for its ability to run on cpu after all...
I find it useful to run the pytorch model with many print statements here and there to check that ggml is giving me the same numbers so that I know what operations to touch
OH, you are running the python one. my bed. but still, should be able to force cpu mode.
nope :( some layers are not implemented for cpu and half precision!
It's bf16 and I can't run it in my device too.
I also struggled, didn't get it to run yet. There are significant differences in the attention/kqv handling between 7B and 40B:
Without multi_query (40B):
self.query_key_value = Linear(
self.hidden_size,
(config.n_head_kv * 2 + config.n_head) * self.head_dim,
bias=config.bias,
)
self.dense = Linear(self.hidden_size, self.hidden_size, bias=config.bias)
self.attention_dropout = nn.Dropout(config.attention_dropout)
self.num_kv = config.n_head_kv
With multi_query (7B):
self.query_key_value = Linear(
self.hidden_size,
3 * self.hidden_size if not config.multi_query else (self.hidden_size + 2 * self.head_dim),
bias=config.bias,
)
self.multi_query = config.multi_query
self.dense = Linear(self.hidden_size, self.hidden_size, bias=config.bias)
self.attention_dropout = nn.Dropout(config.attention_dropout)
self.num_kv = 1
The relevant config for both: Config without multiquery (40B):
"hidden_size": 8192,
"initializer_range": 0.02,
"layer_norm_epsilon": 1e-05,
"model_type": "RefinedWeb",
"n_head": 128,
"n_head_kv": 8,
"n_layer": 60,
"parallel_attn": true,
"torch_dtype": "bfloat16",
"transformers_version": "4.27.4",
"use_cache": true,
"vocab_size": 65024
Config with multiquery (7B):
"hidden_size": 4544,
"initializer_range": 0.02,
"layer_norm_epsilon": 1e-05,
"model_type": "RefinedWebModel",
"multi_query": true,
"n_head": 71,
"n_layer": 32,
"parallel_attn": true,
"torch_dtype": "bfloat16",
"transformers_version": "4.27.4",
"use_cache": true,
"vocab_size": 65024
In the conversion python module for 7B we'll also need the conv_map changed: 'input_layernorm' : 'attention_norm', # 7B The handling of k,q,v re-shape is also different for both
I was actually able to convert, quantize and load the model, but there is some tensor math to debug and modify but I have no 40GB gpu to debug the tensor values at each layer! so it produces garbage for now
I can give you the quantized model if you want to continue my work.
Hi, I'll be super happy to have access to your quantized version of the 40B model if you can share it.
Be aware below is not useful anymore, pls use apaga43's ggml example one.
I did edit a script convert 7b falcon to ggml, and the quantised part original from bloomz.cpp works fine. but need someone modify main.cpp and load it to check.convert-7b-falcon
Reason use the "none"
if alibi is None:
query_layer_ = query_layer.reshape(batch_size, self.num_heads, -1, self.head_dim)
key_layer_ = key_layer.reshape(batch_size, self.num_kv, -1, self.head_dim)
value_layer_ = value_layer.reshape(batch_size, self.num_kv, -1, self.head_dim)
attn_output = F.scaled_dot_product_attention(
query_layer_, key_layer_, value_layer_, None, 0.0, is_causal=True
)
~~Update: the problem I am facing is the element number of WV, or call it whatever is equal to
tensor.ne[1] * tensor.ne[2] * tensor.ne[3]
,~~
which i think it's not right because the others are
tensor.ne[0] * tensor.ne[1] * tensor.ne[2] * tensor.ne[3]
If I force to load it, there's an iostream mistake which eventually will cause segment fall.
~~Update: Did the same trick as niksalli did, now it can be load, but only load, i think its because of attention, OR the ne[] sequence should be reversed or only because I haven't edited all functions like the modelrw.py yet😂 feel free to .... guess? (I changed python modified python script too, but I won't change it here now, it still maybe right. if you want to use that main.cpp, just delete wq wk wv. falcon-7b-main.cpp ~~
~~Update: fix ugly hack like 9216 or 4672 to "2*n_embd/n_heads + n_embd "but still repeat. ~~
Thanks.
Hi, I'll be super happy to have access to your quantized version of the 40B model if you can share it.
how can I share it with you?
upload it to huggingface?
@nikisalli I don't have the expertise to continue your work myself right now but if you're open to it I can get you access to a large GPU server to continue your work? We have access to V100 and V100s - I think they have 90GB if I recall right
Does anyone have any information about the context size of the falcon models? I couldn't find anything, besides a tweet claiming a context size of 8192 https://twitter.com/max_paperclips/status/1662208170247467009 but with no other source than a link to the huggingface repo.
Does anyone have any information about the context size of the falcon models? I couldn't find anything, besides a tweet claiming a context size of 8192 https://twitter.com/max_paperclips/status/1662208170247467009 but with no other source than a link to the huggingface repo.
https://huggingface.co/tiiuae/falcon-40b#model-architecture-and-objective
Sequence length: 2048
Sequence length should be the max useful context size value.
Note that just like for MPT previously (https://github.com/ggerganov/ggml/pull/145), the implementation approach should be to first create a working inference for this model in the ggml repository (https://github.com/ggerganov/ggml/tree/master/examples). This will require porting of modelling_RW.py released with the Falcon-7B model, which implements the model architecture/algorithms (there's already an open issue for that: https://github.com/ggerganov/ggml/issues/217)
FWIW, just like I had previously done with MPT, I published a nerfed miniature low-memory version of this model trained on the tinyshakespeare dataset: https://huggingface.co/jploski/falcon-mini-shakespeare (you can see from config.js that I removed most layers, attention heads, and changed hidden dim to 128, creating a model with <10M parameters).
I did edit a script convert 7b falcon to ggml, and the quantised part original from bloomz.cpp works fine. but need someone to modify main.cpp load it to check.convert-7b-falcon
Reason use the "none"
if alibi is None: query_layer_ = query_layer.reshape(batch_size, self.num_heads, -1, self.head_dim) key_layer_ = key_layer.reshape(batch_size, self.num_kv, -1, self.head_dim) value_layer_ = value_layer.reshape(batch_size, self.num_kv, -1, self.head_dim) attn_output = F.scaled_dot_product_attention( query_layer_, key_layer_, value_layer_, None, 0.0, is_causal=True )
Update: probably shouldn't just keep q, k, v ???
Thank you, I have recently successfully to create ggml with Falcon-7B with your code convert-7b-falcon
I ran it with this command until it finished
python3 convert-falcon-7b-to-ggml.py tiiuae/falcon-7b ./models
But, unfortunatelly, it won't run
$ ./main -m ./models/ggml-model-falcon-7b-f16.bin -t 8 -n 128
main: seed = 1685732638
bloom_model_load: loading model from './models/ggml-model-falcon-7b-f16.bin' - please wait ...
bloom_model_load: n_vocab = 65024
bloom_model_load: n_ctx = 512
bloom_model_load: n_embd = 4544
bloom_model_load: n_mult = 1
bloom_model_load: n_head = 71
bloom_model_load: n_layer = 32
bloom_model_load: f16 = 1
bloom_model_load: n_ff = 18176
bloom_model_load: n_parts = 1
bloom_model_load: ggml ctx size = 15595.67 MB
bloom_model_load: memory_size = 568.00 MB, n_mem = 16384
bloom_model_load: loading model part 1/1 from './models/ggml-model-falcon-7b-f16.bin'
bloom_model_load: tok_embeddings.weight[0] = 4544 x 65024
layers.0.attention_norm.weight[0] = 4544
layers.0.attention_norm.bias[0] = 4544
bloom_model_load: unknown tensor 'layers.0.attention.wv.weight' in model file
main: failed to load model from './models/ggml-model-falcon-7b-f16.bin'
Hey we just ported Falcon 7B and 40B to lit-parrot (single-file implementation), feel free to take a look
But, unfortunatelly, it won't run
you need modify the main.cpp to load it, also wired cause the bias part tuple is not the same when I was converting
Hey we just ported Falcon 7B and 40B to lit-parrot (single-file implementation), feel free to take a look
from what I see it's just a wrapper of torch and gptq. I don't understand what's the useful information for this port in it?
by the way, I found a setup I'm comfortable with for testing and comparing my ggml implementation of falcon 40B with the torch one (I basically load just one block to run it on my silly little gtx1050 4gb) I found that falcon uses some tricky tensor math in its attention head splitting implementation, it will take some time to get it right since it is a mess of many 4d views.
Hey we just ported Falcon 7B and 40B to lit-parrot (single-file implementation), feel free to take a look
from what I see it's just a wrapper of torch and gptq. I don't understand what's the useful information for this port in it?
I just thought the model implementation could be useful to look at as an extra source. Sorry for the noise if it’s not of help.
Would support directly in llama.cpp be considered or would it be best to keep llama.cpp only in support of llama models and keep other architectures in separate projects or forks?
Would support directly in llama.cpp be considered or would it be best to keep llama.cpp only in support of llama models and keep other architectures in separate projects or forks?
That's a call @ggerganov has to make, I'd guess first a fork later a smarter ggllm version. However, first we need the two models running .. and that would be best to be done by someone who's quite experienced in Torch and ggml. I gave it a failed try, I don't know how to get the multi query properly in and debugging issues is just a nightmare.
The only positive side is that multi query looks worthwhile in general to support, so once we have it we can also train and tune for it.
But, unfortunatelly, it won't run
you need modify the main.cpp to load it, ~also wired cause the bias part tuple is not the same when I was converting~
How to do that? Which code that I must modify? Sorry I am noob for this..
How to do that? Which code that I must modify? Sorry I am noob for this..
Emm I don't know neither, still working on ggml.c 😂
model
Did you try using bloomz.cpp?
Falcon LLM 40b and 7b were just open sourced under a license which allows commercial use (
with royalties for over $1 million revenue per year) and have are topping the Huggingface Open LLM leaderboard. It seems to be based on a modified gpt3 architecture. I’m wondering if support in llama.cpp would be considered.https://huggingface.co/tiiuae/falcon-40b