ggerganov / llama.cpp

LLM inference in C/C++
MIT License
62.51k stars 8.97k forks source link

Shape Error When Running Inference after Converting OpenLlama 3B to GGML #1709

Closed Emm9625 closed 1 year ago

Emm9625 commented 1 year ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Expected Behavior

Model loads successfully and inference can be run.

Current Behavior

Model fails to load with shape error.

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

$ lscpu

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              1
Core(s) per socket:              8
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           106
Model name:                      Intel(R) Xeon(R) Gold 5315Y CPU @ 3.20GHz
Stepping:                        6
CPU MHz:                         3200.011
BogoMIPS:                        6400.05
Hypervisor vendor:               Xen
Virtualization type:             full
L1d cache:                       384 KiB
L1i cache:                       256 KiB
L2 cache:                        10 MiB
L3 cache:                        96 MiB
NUMA node0 CPU(s):               0-7
Vulnerability Itlb multihit:     KVM: Vulnerable
Vulnerability L1tf:              Mitigation; PTE Inversion
Vulnerability Mds:               Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Mmio stale data:   Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, STIBP disabled, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdt
                                 scp lm constant_tsc rep_good nopl cpuid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand h
                                 ypervisor lahf_lm abm 3dnowprefetch cpuid_fault pti ibpb fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_n
                                 i xsaveopt xsavec xgetbv1 xsaves umip rdpid

$ uname -a

Linux nwlujxf2ho 5.4.0-122-generic #138~18.04.1-Ubuntu SMP Fri Jun 24 14:14:03 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Python 3.9.16

 GNU Make 4.2.1
Built for x86_64-pc-linux-gnu
Copyright (C) 1988-2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

  1. Conversion is ran with this command "python convert.py ../open_llama_3b_600bt_preview/"
  2. Inference : ./main -m ../open_llama_3b_600bt_preview/ggml-model-f16.bin

Failure Logs

CONVERSION:

python convert.py ../open_llama_3b_600bt_preview/
Loading model file ../open_llama_3b_600bt_preview/pytorch_model.bin
Loading vocab file ../open_llama_3b_600bt_preview/tokenizer.model
Writing vocab...
INFERENCE:

root@nwlujxf2ho:/notebooks/llama.cpp# ./main -m ../open_llama_3b_600bt_preview/ggml-model-f16.bin
main: build = 1 (f4c55d3)
main: seed  = 1686014332
llama.cpp: loading model from ../open_llama_3b_600bt_preview/ggml-model-f16.bin
llama_model_load_internal: format     = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 3200
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 25
llama_model_load_internal: n_layer    = 26
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 1 (mostly F16)
llama_model_load_internal: n_ff       = 8704
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 3B
llama_model_load_internal: ggml ctx size =    0.06 MB
error loading model: llama.cpp: tensor 'layers.0.feed_forward.w1.weight' has wrong shape; expected  3200 x  8704, got  3200 x  8640
llama_init_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '../open_llama_3b_600bt_preview/ggml-model-f16.bin'
main: error: unable to load model
ThioJoe commented 1 year ago

Also seeing this error after trying to use the latest WizardLM-7B-uncensored.ggml.q8_0.bin

Actually I realize was loading the wrong model which was using the old format. Downloaded one for "ggjt v3" and the error went away. Though I'm getting a different error now, which is this one: https://github.com/ggerganov/llama.cpp/issues/1732

BrickBee commented 1 year ago

error loading model: llama.cpp: tensor 'layers.0.feed_forward.w1.weight' has wrong shape; expected 3200 x 8704, got 3200 x 8640

Same for me. It is also broken in the original commit (ffb06a345e3a9e30d39aaa5b46a23201a74be6de), tested with the 600bt version.

The error can be fixed by applying the hack in #1588. Quantized models will also work fine then. I don't see either the original hack or a suitable replacement being merged with the original PR, @SlyEcho.

Something else broke since it was added though, as quantized models will output garbage in the current version (92f20d9942c86daeb78637bdad7296a572f4da28). The converted fp16 model still works fine (with the hack).

SlyEcho commented 1 year ago

@BrickBee, which quantization format is broken for you? I can confirm that 3B Q4_0 and Q5_1 is working with the current master build = 701 (4f9c43e)

I have the files up on https://huggingface.co/SlyEcho/open_llama_3b_ggml and if you want to create them yourself, the Makefile and diff file can create all the models and checksums from scratch.

BrickBee commented 1 year ago

I can confirm that the quantized files that you've linked work fine with the release version that you have linked. My quantized versions that I've created at the time of the PR also still work correctly with the current version. Yet when using the current version to convert (using the patch) and quantize the source model again, then the quantized version will output garbage. The resulting files also differ in file size. Yours: open-llama-3b-q4_0.bin: 1.928.446.208 Mine: open_llama_3b_q4_0.ggml: 1.954.846.208

SlyEcho commented 1 year ago

OK, I can trace it back to PR #1807, which for some reason starts to quantize a single tensor using Q6_K, regardless of the user's choice of format and making those models broken when K quants are not compiled (they are optional) or not supported.

This was actually reverted temporarily in #1711, but added back in.

What was the thinking behind this change, @ikawrakow?

ikawrakow commented 1 year ago

What was the thinking behind this change, @ikawrakow?

Clearly, there wasn't enough thinking here ;-)

More seriously, the decision to bring it back was based on a discussion with @ggerganov that we should use the more accurate Q6_K quantization for the output weights once k-quants are implemented for all ggml-supported architectures (CPU, GPU via CUDA and OpenCL, and Metal for the Apple GPU). Using Q6_K for output.weight does improve generation quality at nearly negligible increase in model size. What we missed in the decision making process is that in the meantime there are models other than Meta LLaMA being used, which have tensor sizes that are not a multiple of the k-quants super-block size of 256. This is now taken care of with the last missing check in PR #1932, so llama.cpp can be built without the hassle of explicitly disabling k-quants at compile time.

On that note, I wonder how the OpenLLaMA 3B model is being used. I downloaded the fp16 model from Hugginface and used the convert.py script to convert to ggml format. But the model wouldn't load because the feed forward network size is being mispredicted as 8704 instead of the actual size of 8640 by this line in llama.cpp:

 const uint32_t n_ff = ((2*(4*hparams.n_embd)/3 + hparams.n_mult - 1)/hparams.n_mult)*hparams.n_mult;

If I fix this so I'm able to load the model and run a perplexity calculation, I get wild values in excess of 2000. What am I missing? Is it because the tokenization is different and, if so, how do you use the 3B model? I would like to use it to work on the k-quants adaptation to not 256-divisible model sizes, so any help is appreciated.

BrickBee commented 1 year ago

Conversion and fp16 inference works after applying this diff. This was by the way the original point of this issue. The 3b model can't be used with the current code if no pre-converted version is available (or the code is patched).

SlyEcho commented 1 year ago

On that note, I wonder how the OpenLLaMA 3B model is being used. I downloaded the fp16 model from Hugginface and used the convert.py script to convert to ggml format.

convert.py is still broken and we didn't want to commit the crude hacks. But since the model has a free license the files are up for download.

Check my HF repo for the converted files and also the full Makefile to run it yourself.