Closed Emm9625 closed 1 year ago
Also seeing this error after trying to use the latest WizardLM-7B-uncensored.ggml.q8_0.bin
Actually I realize was loading the wrong model which was using the old format. Downloaded one for "ggjt v3" and the error went away. Though I'm getting a different error now, which is this one: https://github.com/ggerganov/llama.cpp/issues/1732
error loading model: llama.cpp: tensor 'layers.0.feed_forward.w1.weight' has wrong shape; expected 3200 x 8704, got 3200 x 8640
Same for me. It is also broken in the original commit (ffb06a345e3a9e30d39aaa5b46a23201a74be6de), tested with the 600bt version.
The error can be fixed by applying the hack in #1588. Quantized models will also work fine then. I don't see either the original hack or a suitable replacement being merged with the original PR, @SlyEcho.
Something else broke since it was added though, as quantized models will output garbage in the current version (92f20d9942c86daeb78637bdad7296a572f4da28). The converted fp16 model still works fine (with the hack).
@BrickBee, which quantization format is broken for you? I can confirm that 3B Q4_0 and Q5_1 is working with the current master build = 701 (4f9c43e)
I have the files up on https://huggingface.co/SlyEcho/open_llama_3b_ggml and if you want to create them yourself, the Makefile and diff file can create all the models and checksums from scratch.
I can confirm that the quantized files that you've linked work fine with the release version that you have linked. My quantized versions that I've created at the time of the PR also still work correctly with the current version. Yet when using the current version to convert (using the patch) and quantize the source model again, then the quantized version will output garbage. The resulting files also differ in file size. Yours: open-llama-3b-q4_0.bin: 1.928.446.208 Mine: open_llama_3b_q4_0.ggml: 1.954.846.208
OK, I can trace it back to PR #1807, which for some reason starts to quantize a single tensor using Q6_K, regardless of the user's choice of format and making those models broken when K quants are not compiled (they are optional) or not supported.
This was actually reverted temporarily in #1711, but added back in.
What was the thinking behind this change, @ikawrakow?
What was the thinking behind this change, @ikawrakow?
Clearly, there wasn't enough thinking here ;-)
More seriously, the decision to bring it back was based on a discussion with @ggerganov that we should use the more accurate Q6_K
quantization for the output weights once k-quants are implemented for all ggml-supported architectures (CPU, GPU via CUDA and OpenCL, and Metal for the Apple GPU). Using Q6_K
for output.weight
does improve generation quality at nearly negligible increase in model size. What we missed in the decision making process is that in the meantime there are models other than Meta LLaMA being used, which have tensor sizes that are not a multiple of the k-quants super-block size of 256. This is now taken care of with the last missing check in PR #1932, so llama.cpp
can be built without the hassle of explicitly disabling k-quants at compile time.
On that note, I wonder how the OpenLLaMA 3B model is being used. I downloaded the fp16
model from Hugginface and used the convert.py
script to convert to ggml
format. But the model wouldn't load because the feed forward network size is being mispredicted as 8704
instead of the actual size of 8640
by this line in llama.cpp
:
const uint32_t n_ff = ((2*(4*hparams.n_embd)/3 + hparams.n_mult - 1)/hparams.n_mult)*hparams.n_mult;
If I fix this so I'm able to load the model and run a perplexity calculation, I get wild values in excess of 2000. What am I missing? Is it because the tokenization is different and, if so, how do you use the 3B model? I would like to use it to work on the k-quants adaptation to not 256-divisible model sizes, so any help is appreciated.
Conversion and fp16 inference works after applying this diff. This was by the way the original point of this issue. The 3b model can't be used with the current code if no pre-converted version is available (or the code is patched).
On that note, I wonder how the OpenLLaMA 3B model is being used. I downloaded the fp16 model from Hugginface and used the convert.py script to convert to ggml format.
convert.py is still broken and we didn't want to commit the crude hacks. But since the model has a free license the files are up for download.
Check my HF repo for the converted files and also the full Makefile to run it yourself.
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
Model loads successfully and inference can be run.
Current Behavior
Model fails to load with shape error.
Environment and Context
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
$ lscpu
$ uname -a
Steps to Reproduce
Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
Failure Logs