LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.84k stars 342 forks source link

How to load quantized gpt-j based models #171

Closed shimaowo closed 1 year ago

shimaowo commented 1 year ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Expected Behavior

Using gglm to convert and quantize a gpt-j model should load properly in koboldcpp.

Current Behavior

koboldcpp crashes on startup with the following output:

Welcome to KoboldCpp - Version 1.21.3
Attempting to use OpenBLAS library for faster prompt ingestion. A compatible libopenblas will be required.
Initializing dynamic library: koboldcpp_openblas.dll
==========
Loading model: D:\ai\models\ptest\ggml-model-q5_1.bin
[Threads: 11, BlasThreads: 11, SmartContext: False]

---
Identified as GPT-J model: (ver 102)
Attempting to Load...
---
System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
gptj_model_load: loading model from 'D:\ai\models\ptest\ggml-model-q5_1.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx   = 2048
gptj_model_load: n_embd  = 4096
gptj_model_load: n_head  = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot   = 64
gptj_model_load: ftype   = 1009
GGML_ASSERT: ggml.c:3450: wtype != GGML_TYPE_COUNT

Environment and Context

Win10, rtx3080, 64gb ram

Other info

It seems like there may be some kind of mismatch between your version of ggml files and the actual ggml repo. In order to get a gpt-j model to convert and quantize properly, I had to use the tools in that actual repo, under the relevant example folder.

It's worth nothing that the versions in the llamacpp repo don't support these models either, as not all of the ggml formats/etc have made it over there yet.

koboldcpp has worked correctly on other models I have converted to q5_1 and tried. It failed on 2 gpt-j models, at which point I stopped trying. Also the quantized models themselves work when using the gpt-j example application from ggml.

LostRuins commented 1 year ago

What do you mean mismatch? Are you saying that a model quantized on the official repo does not work on mine?

shimaowo commented 1 year ago

Yes, in some cases. If you take the tip of the ggml repo and build it, and use the (gpt-j folder's) convert-h5-to-ggml.py, followed by gpt-j-quantize (because it builds multiple quantize executables in that repo), the resulting model will work with ggml's own gpt-j inference executable, but will crash koboldcpp as detailed above.

I assumed from the error that there is a source mismatch between this repo's embedded ggml files and the original repo, but I didn't look into it very far.

ggml is fairly problematic in general with these sort of hidden incompatibilities (it has multiple convert-h5-to-ggml.py scripts that do different things, for example, and it's annoying how it and llamacpp constantly handle the same inputs differently). So this may be less a bug and more a heads-up that this doesn't work, and that it's likely this sort of problem will keep cropping up.

shimaowo commented 1 year ago

This also may be due to my running the tip of ggml, but the latest (non-cuda) release exe of koboldcpp, so it's possible there are some unreleased updates that might affect things. I'll try with a source build, but it probably won't be for a day or two.

LostRuins commented 1 year ago

Please try again with the latest version of my repo

shimaowo commented 1 year ago

Awesome, a quick test with 1.23 looks like it works. I'll try on a couple other models later, but they all had the same issue so I expect this fixed the issue.