cmp-nct / ggllm.cpp

Falcon LLM ggml framework with CPU and GPU support
Other
244 stars 21 forks source link

Unable to run TheBloke Falcon40b-instruct #1

Open iHaagcom opened 1 year ago

cmp-nct commented 1 year ago

convert it with falcon_convert_demo.py You likely use an old version that is not for the new tensor kqv

iHaagcom commented 1 year ago

Using the model from thebloke

cmp-nct commented 1 year ago

Using the model from thebloke

I believe the first models of TheBloke used a very early ggml conversion, it did not split and reshape the KQV tensor yet so it's incompatible. You will find a demo hf to ggml python script in this repo which can convert the original HF binaries into a GGML V0 format that will work.

iHaagcom commented 1 year ago

Using the model from thebloke

I believe the first models of TheBloke used a very early ggml conversion, it did not split and reshape the KQV tensor yet so it's incompatible. You will find a demo hf to ggml python script in this repo which can convert the original HF binaries into a GGML V0 format that will work.

Okay, no worries. I’m unable to compile ggml-falcon that seems to be the only branch that works with that model, unsure why I’m unable to compile on windows mingw reports this error ggml.c(1442): fatal error c1003: error count exceeds 100 so that could be one reason behind its failure.

https://github.com/jploski/ggml/tree/falcon40b

cmp-nct commented 1 year ago

Try this: 1) Get Visual Studio Code and install the cmake addon and C++ tools 2) get a fresh clone of this repository 3) use cmake in VScode to compile

It's a bit easier with the IDE

iHaagcom commented 1 year ago

Try this:

  1. Get Visual Studio Code and install the cmake addon and C++ tools
  2. get a fresh clone of this repository
  3. use cmake in VScode to compile

It's a bit easier with the IDE

Tried it with the full IDE I’ll just use it on Linux for now I think. IMG_1873

Or use your build and create a new ggml of WizardLM-Falcon and Falcon40b

I appreciate the reply, thank you.

cmp-nct commented 1 year ago

Creating the ggml is a very quick process, it takes just a few minutes if you have the HF binaries available, quantizing even faster.

TheBloke commented 1 year ago

Using the model from thebloke

I believe the first models of TheBloke used a very early ggml conversion, it did not split and reshape the KQV tensor yet so it's incompatible. You will find a demo hf to ggml python script in this repo which can convert the original HF binaries into a GGML V0 format that will work.

To be exact I used the code from @jploski 's GGML branch as of his latest commit two days ago: https://github.com/jploski/ggml/commit/8b22ea87669bc3a011507bb29e75dc7274b3a182

If the code has changed I will re-do them all

So is the development for Falcon GGML now under llama.cpp rather than GGML? I knew you guys were working on getting it in llama.cpp I just wasn't sure if that was a parallel effort or if this is now the only codebase I should consider.

Let me know and I'll re-do the models this evening

cmp-nct commented 1 year ago

Using the model from thebloke

I believe the first models of TheBloke used a very early ggml conversion, it did not split and reshape the KQV tensor yet so it's incompatible. You will find a demo hf to ggml python script in this repo which can convert the original HF binaries into a GGML V0 format that will work.

To be exact I used the code from @jploski 's GGML branch as of his latest commit two days ago: jploski/ggml@8b22ea8

If the code has changed I will re-do them all

So is the development for Falcon GGML now under llama.cpp rather than GGML? I knew you guys were working on getting it in llama.cpp I just wasn't sure if that was a parallel effort or if this is now the only codebase I should consider.

Let me know and I'll re-do the models this evening

Hi I just downloaded your model and it will work in this release, but it's wrongly labelled as "ggmlv3", the format is "ggmlv0" So re-doing them makes sense in any case, by using this project you'll get V3 binaries which support mmap tensors (loading time almost instant on subsequent runs)

Now from the development point of view, I can't speak for everyone just for myself: I started this because I view Falcon as the most important model out there at the moment and in ggml examples it's a dead end. ggml examples are by design minimalistic, what we need is the opposite which currently only is available in the llama.cpp. (The examples are what made this release possible) So we'll soon see GPU acceleration in this fork, also the new K type quantizers are available and the quantization is of higher performance as ggml didn't see many updates. (Careful: QK quantizer for 7B do not work, for 40B they seem to work fine.)

So currently this is the most advanced option for Falcon models.

In any case, binaries will likely change in the future. There is a lot of development still todo, some of it will affect binaries.

It's important to note: 1) The GGML example you used will not work with the V3 ggml binaries produced in this repo 2) The 16/32 bit binary, when produced by the python model, will be compatible (currently same python code) 3) If you also want a V3 16/32 variant you can use falcon_quantize.exe with type 1/0 on the V0 GGML´which essentially just stores it in the new format

TheBloke commented 1 year ago

OK thanks for the explanations. Yes I will re-do them using this fork then.

Is it your plan/expectation that this will be merged into llama.cpp at some point?

cmp-nct commented 1 year ago

OK thanks for the explanations. Yes I will re-do them using this fork then.

Is it your plan/expectation that this will be merged into llama.cpp at some point?

Currently the interest on side of llama.cpp appears low in that regard but that might change when people start switching to the Falcon models. The focus on llama.cpp in the past weeks was mostly backend/ggml improvements.

It would be great to see this officially supported given Falcons legal and true open source status. Though maybe we'll just keep the two projects in parallel.

maddes8cht commented 1 year ago

@theBloke will you do a 7b falcon too? for testing the 7b will still be apropriate. (And I'm still having trouble converting it myself on the slow machine i need it on)

TheBloke commented 1 year ago

Yeah sorry, I should do that. I'll do it today.

@cmp-nct remind me - I can't do any k-quants with 7B, but the original q4_0, q4_1, q5_0, q5_1, q8_0 will work?

cmp-nct commented 1 year ago

Yeah sorry, I should do that. I'll do it today.

@cmp-nct remind me - I can't do any k-quants with 7B, but the original q4_0, q4_1, q5_0, q5_1, q8_0 will work?

Yes all non K quants are working fine, with the latest version you can not make wrong quantizations anymore it will abort.

Speed is quite nice with 7B, I'm getting 33 tokens/second on Q8_0 (same on 5 and 4) Of course it will slow down, the issue is not fixed (15/second at 512 generation)

TheBloke commented 1 year ago

Also just to check - you say the Python code is the same, so falcon_convert_demo.py is the same as the convert-hf-to-ggml.py from the GGML repo ? So I could use either, no

@TheBloke will you do a 7b falcon too? for testing the 7b will still be apropriate. (And I'm still having trouble converting it myself on the slow machine i need it on)

Sorry for the delay, they are now uploaded:

cmp-nct commented 1 year ago

I've based it on the latest ggml falcon 40b example that was available, afaik it was not accepted into official ggml because it uses a new repeat function related to self attention which gg was not ready to incorporate and currently has no work around. There are other convert-hf-to-ggml.py scripts for falcon which won't work.

I recommend to always using the latest python and quantizer from this repo. I also have improvements in mind for quantizing falcon, though first raw performance needs more attention.

jploski commented 1 year ago

Do we still have the alignment/padding problem which prevents mmap from working correctly? If so, that is something to be addressed in the conversion script, isn't it?

cmp-nct commented 1 year ago

Do we still have the alignment/padding problem which prevents mmap from working correctly? If so, that is something to be addressed in the conversion script, isn't it?

I think it's more than just that, the python converter produces an old model version.

However, as soon as you process the python result with the ggllm.cpp quantizer you will have a V3 binary, it also supports 16 and 32 bit output). So the only update on the python script I'd see relevant is the token scoring, if there is any benefit to it. I'd not invest more time into python code if our CPP quantizer can do it better.

jploski commented 1 year ago

I'd not invest more time into python code if our CPP quantizer can do it better.

I agree, maybe we can add an info message to the script that the output should be post-processed, though. Because the usual expectation is probably that the 16/32-bit GGML file produced by the Python conversion script is "good to go".

cmp-nct commented 1 year ago

I'd not invest more time into python code if our CPP quantizer can do it better.

I agree, maybe we can add an info message to the script that the output should be post-processed, though. Because the usual expectation is probably that the 16/32-bit GGML file produced by the Python conversion script is "good to go".

Good idea. Do you know about the token scoring ? I did not dig into that part yet. Because if that is not relevant I'd remove the warning from falcon_quantize.

The information how to create models is also in the README

Conversion:

use falcon_convert_demo.py to produce a GGMLv0 binary from HF - not recommended to be used directly
use examples/falcon_quantize to convert these into GGMLv3 binaries of your choice including mmap support from there on
Important: The Falcon 7B model features tensor sizes that do not support K-type quantizers - use the traditional quantization for those
jploski commented 1 year ago

I'd not invest more time into python code if our CPP quantizer can do it better.

I agree, maybe we can add an info message to the script that the output should be post-processed, though. Because the usual expectation is probably that the 16/32-bit GGML file produced by the Python conversion script is "good to go".

Good idea. Do you know about the token scoring ? I did not dig into that part yet. Because if that is not relevant I'd remove the warning from falcon_quantize.

No, sorry, I can't say anything about token scoring.

berkayaltug commented 1 year ago

I want to run it on CPU but I get an error like this. what could be the issue? Screenshot_1

cmp-nct commented 1 year ago

Just like the message says: with the old weights you need to add https://huggingface.co/ehartford/WizardLM-Uncensored-Falcon-7b/blob/main/tokenizer.json (or the correct one for the model you use) into the directory with the model. TheBloke will make new weights which won't need that. Weights called "GGML" need it "GGCC" won't need the file

cmp-nct commented 1 year ago

In addition: please update your version to latest commit. For CPU rendering we've had a bug in code in the past 10 hours git pull and cmake/make again

berkayaltug commented 1 year ago

Worked smoothly with https://huggingface.co/TheBloke/WizardLM-Uncensored-Falcon-40B-GGML New model.

Thanks @cmp-nct & @TheBloke

maddes8cht commented 1 year ago

I didn't find this thread at first, so I started a new discussion #80 about converting. I am still not able to convert an original model myself. @berkayaltug also seems to have just taken the finished model from @theBloke in the end.

Question remains how @theBloke has actually managed the conversion. The error message seems to say that the tokenizer.json is not searched in the SAME directory as the file to be converted, but in a subdirectory with the same name as the file to be converted ( ggml-model-falcon-40b-instruct-f32.bin/tokenizer.json).

But I cannot create such a directory (with the same name as an existing file). There seems to be no possibility with the existing falcon-quantize (at least under Windows) to convert a model itself. Can someone confirm or describe a concrete procedure how to quantize an original falcon model under windows itself? I have the impression that it is not really possible at the moment.

maddes8cht commented 1 year ago

I would really like to be able to convert falcon models myself, as there are some finetuned falcon models around that no one converted to ggml for this falcon-main, as samatha-falcon7b (https://huggingface.co/ehartford/samantha-falcon-7b/tree/main) or gorilla-falcon7b (https://huggingface.co/gorilla-llm/gorilla-falcon-7b-hf-v0/tree/main). @theBloke seems to have lost interest in falcon and only seems to provide llama models. So i need to be able to do it myself. If i succeed, i may provide them on huggingface.

TheBloke commented 1 year ago

Here's how my scripts do the FP16 conversion:

ggml_fp16_file = 'ggml-model-source-f32.bin'
ggml_convert_script_name = 'falcon_convert.py'
ggml_quantize_command = 'build/bin/falcon_quantize'

self.fp16_command = f"cp {self.model_input_dir}/tokenizer.json {self.output_dir} && python3 {self.ggml_base}/{self.ggml_convert_script_name} {self.model_input_dir} {self.output_dir} 1 && mv {self.output_dir}/{self.ggml_fp16_file} {self.fp16}"

So yes I copy the tokenizer.json from the source model in to the intended output directory prior to running the convert. It's a bit weird, but the above works. John said he didn't have much interest in improving convert.py but I'm sure he'd accept a PR that made it work more sensibly. Another issue is that it always generates in float32 - the parameter for setting it to float16 is broken. Or at least it was last time I checked; I've not reviewed it recently.

It's not that I've lost interest in Falcon, but yes Llama 2 is much more interesting and will always have priority if I have to choose. Falcon is only usable as GGML, and although the GGMLs are great they aren't supported by any UIs or Python libraries (I did ask marella to look into ctransformers support, but I don't think he's been able to look at it yet). So they are not going to be usable for nearly as many people.

The one thing that made Falcon really interesting was the commercial support. But with Llama 2, that's no longer the case. Though I can see that Falcon 40 still has a place given the current lack of Llama 2 34B.

If I've missed any notable Falcon 40B models then create a request for them on my Discord #model-requests channel and I'll try to get GGMLs done. I've given up bothering about Falcon GPTQs as they're so slow and there's no sign that's ever going to be fixed in AutoGPTQ. (Although now I say that, I just remembered that Text Generation Inference supports them - maybe that's quicker? I'm not sure.)

maddes8cht commented 1 year ago

hello @TheBloke , thanks for the answer - I will be able to test again in a few hours, but it seems that somehow in windows it doesn't work with tokenzer.json in the same directory as the file. In my opinion, falcon is still the best really free base model available, with llama 2 only beeing "free as in beer". I think it should get more attention. It's not on the radar of a lot of people. Here is rhe problem: There is not any other software to run falcon because there is not much going on (for examples not so much finetuned models) So there are not as many people interested in providing finetuned models or convert the available ones.

For me, this is also about building up attention.

jploski commented 1 year ago

The way I do it for 7b (haven't checked for 40b, but the same approach should work):

  1. Copy .py and .json from the original 7b/40b model's distribution to the directory containing pytorch* of the model that I wish to convert.
  2. falcon_convert.py path/to/directory/with/model path/to/directory/with/model 1
  3. bin/falcon_quantize path/to/file.ggml path/to/quantized_file.ggml 9

I also agree that the llama2 license is not really free. Meta may terminate it if they decide you breached its terms, which among other things include a prohibition from "improving another language model". Further you must agree to indemnify (defend in court) Meta for any sort of copyright problems Meta created in training of the original model.

The trouble is that Meta has successfully captured academic community and (as evidenced by comments above) made people believe they are the good guys of LLM. They are not. They want to improve their models (and only their models) for free while retaining control over how they are used.

maddes8cht commented 1 year ago

In fact, I have found a bug in the code that causes that falcon_quantize actually looks for the tokenizer.json file in a subdirectory with the same name as the file to convert. This happens when you run falcon-convert from the directory where the file to convert is located - where under windows it is not possible to have a directory with the same name as a file in the containing directory. I have fixed the bug locally and will send a PR about it after I have tested some different call options.