ggerganov / llama.cpp

LLM inference in C/C++
MIT License
64.87k stars 9.3k forks source link

"Inference Issue with llama.cpp Using Custom Converted qwen1.5 Weights" #5563

Closed wanshichenguang closed 3 months ago

wanshichenguang commented 6 months ago

I need to raise an issue regarding using llama.cpp for inference with qwen1.5.

When using the official weights provided by qwen, the inference works fine:

(llama.cpp) root@8411db7a5b9f:~/llama.cpp-master# ./main -m /root/model/qwen/Qwen1.5-0.5B-Chat-GGUF/qwen1_5-0_5b-chat-q2_k.gguf -n 512 --color -i -cml -f prompts/chat-with-qwen.txt Log start main: build = 0 (unknown) main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu main: seed = 1708251273 llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /root/model/qwen/Qwen1.5-0.5B-Chat-GGUF/qwen1_5-0_5b-chat-q2_k.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.name str = Qwen1.5-0.5B-Chat-AWQ-fp16 llama_model_loader: - kv 2: qwen2.block_count u32 = 24 llama_model_loader: - kv 3: qwen2.context_length u32 = 32768 llama_model_loader: - kv 4: qwen2.embedding_length u32 = 1024 llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 2816 llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 16 llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 16 llama_model_loader: - kv 8: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 9: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 10: qwen2.use_parallel_residual bool = true llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 13: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 14: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 15: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 16: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 18: tokenizer.chattemplate str = {% for message in messages %}{{'<|im... llama_model_loader: - kv 19: general.quantization_version u32 = 2 llama_model_loader: - kv 20: general.file_type u32 = 10 llama_model_loader: - type f32: 121 tensors llama_model_loader: - type q2_K: 97 tensors llama_model_loader: - type q3_K: 72 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens definition check successful ( 293/151936 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = qwen2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 151936 llm_load_print_meta: n_merges = 151387 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 1024 llm_load_print_meta: n_head = 16 llm_load_print_meta: n_head_kv = 16 llm_load_print_meta: n_layer = 24 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_embd_head_k = 64 llm_load_print_meta: n_embd_head_v = 64 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 2816 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 0.5B llm_load_print_meta: model ftype = Q2_K - Medium llm_load_print_meta: model params = 619.57 M llm_load_print_meta: model size = 278.92 MiB (3.78 BPW) llm_load_print_meta: general.name = Qwen1.5-0.5B-Chat-AWQ-fp16 llm_load_print_meta: BOS token = 151643 '<|endoftext|>' llm_load_print_meta: EOS token = 151645 '<|im_end|>' llm_load_print_meta: PAD token = 151643 '<|endoftext|>' llm_load_print_meta: LF token = 148848 'ÄĬ' llm_load_tensors: ggml ctx size = 0.11 MiB llm_load_tensors: CPU buffer size = 278.92 MiB ......................................... llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 48.00 MiB llama_new_context_with_model: KV self size = 48.00 MiB, K (f16): 24.00 MiB, V (f16): 24.00 MiB llama_new_context_with_model: CPU input buffer size = 4.01 MiB llama_new_context_with_model: CPU compute buffer size = 298.75 MiB llama_new_context_with_model: graph splits (measure): 1

system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | main: interactive mode on. Reverse prompt: '<|im_start|>user ' sampling: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampling order: CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature generate: n_ctx = 512, n_batch = 512, n_predict = 512, n_keep = 10

== Running in interactive mode. ==

system You are a helpful assistant. user

你好啊 我也很高兴能为您提供帮助。您有什么问题或需要解答的,请告诉我,我会尽力为您解答。

你是谁? 我是阿里云开发的一款超大规模语言模型,基于预训练的大量文本数据,可以回答问题、创作文字,也可以表达观点、撰写代码、玩乐等。如果您有任何问题或需要帮助,请随时告诉我,我会尽力提供解答。

好的 好的,我将尽我最大的能力为您解答。您有什么需要帮助的,请随时告诉我!

However, when I try to convert the model myself using convert.py or convert-hf-to-gguf.py:

python convert.py /root/model/qwen/Qwen15-05B-Chat/ --vocab-type bpe --pad-vocab ./quantize /root/model/qwen/Qwen15-05B-Chat/ggml-model-f16.gguf /root/model/qwen/Qwen15-05B-Chat/ggml-model-Q4_K_M.gguf Q4_K_M The inference does not work properly. (llama.cpp) root@8411db7a5b9f:~/llama.cpp-master# ./main -m /root/model/qwen/Qwen15-05B-Chat/ggml-model-Q4_K_M.gguf -n 512 --color -i -cml -f prompts/chat-with-qwen.txt Log start main: build = 0 (unknown) main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu main: seed = 1708251318 llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /root/model/qwen/Qwen15-05B-Chat/ggml-model-Q4_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = qwen llama_model_loader: - kv 2: llama.context_length u32 = 32768 llama_model_loader: - kv 3: llama.embedding_length u32 = 1024 llama_model_loader: - kv 4: llama.block_count u32 = 24 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 2816 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 64 llama_model_loader: - kv 7: llama.attention.head_count u32 = 16 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 16 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 11: general.file_type u32 = 15 llama_model_loader: - kv 12: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,151936] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 151643 llama_model_loader: - kv 18: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 20: tokenizer.chattemplate str = {% for message in messages %}{{'<|im... llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 121 tensors llama_model_loader: - type q4_K: 145 tensors llama_model_loader: - type q6_K: 25 tensors llm_load_vocab: special tokens definition check successful ( 293/151936 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 151936 llm_load_print_meta: n_merges = 151387 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 1024 llm_load_print_meta: n_head = 16 llm_load_print_meta: n_head_kv = 16 llm_load_print_meta: n_layer = 24 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_embd_head_k = 64 llm_load_print_meta: n_embd_head_v = 64 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 2816 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 619.57 M llm_load_print_meta: model size = 382.62 MiB (5.18 BPW) llm_load_print_meta: general.name = qwen llm_load_print_meta: BOS token = 151643 '<|endoftext|>' llm_load_print_meta: EOS token = 151643 '<|endoftext|>' llm_load_print_meta: PAD token = 151643 '<|endoftext|>' llm_load_print_meta: LF token = 148848 'ÄĬ' llm_load_tensors: ggml ctx size = 0.11 MiB llm_load_tensors: CPU buffer size = 382.62 MiB ................................................. llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 48.00 MiB llama_new_context_with_model: KV self size = 48.00 MiB, K (f16): 24.00 MiB, V (f16): 24.00 MiB llama_new_context_with_model: CPU input buffer size = 4.01 MiB llama_new_context_with_model: CPU compute buffer size = 298.75 MiB llama_new_context_with_model: graph splits (measure): 1

system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | main: interactive mode on. Reverse prompt: '<|im_start|>user ' sampling: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampling order: CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature generate: n_ctx = 512, n_batch = 512, n_predict = 512, n_keep = 10

== Running in interactive mode. ==

system You are a helpful assistant.

你好啊 pwd事业单位事业单位.alibaba.hadoop构造包一解知.mybatis 如('=''}} 任职满意度族自治州任职监事eurs职位:length事业单位了解事业单位口事业单位事业单位.mapper事业单位事业单位不能�.squareup.alibaba.runtime.aws口 不知.linspace事业单位事业单位事业单位.oauth急剧选择.hadoop.custom包事业单位单项知.aws问题事业单位.jsoup.radians�人权依据.Cascade��.animation.flink详细事业单位事业单位知.poi.mybatis不知知无法.closePath任职�avl了解.fillRect.recyclerview口.flink.alibaba.alibaba口.getObject.login事业单位对你.hadoopuch.linspace.hadoop旅游知.alibaba.hadoopメント一.poi.fetchone急剧导导可�.jdbc事业单位.mybatis事业单位事业单位.jsoup知事业单位事业单位事业单位.mybatis.GL多少.alibaba不知事业单位不能..hadoop选择.auto

你是谁? .mybatis事业单位知.poi� �.mybatis事业单位.一软'}} 依据不.radians�.hadoop了解.mybatis事业单位事业单位知.alibaba口人才.layout事业单位事业单位了解事业单位 如.object导.linspace�事业单位多少国有企业由于知事业单位知事业单位包 stoiavl?"; (prompt.oauth无法完全�上传了解大型多人旅游.component事业单位,ep事业单位.aws事业单位知事业单位enie.alibaba.runtime事业单位事业单位.NoSuch知.annotations.recyclerview.不能一年一度肘xls对我们.customités 不床垫该KHTML.getObject.oauth事业单位知.jsoup了解事业单位.mybatis.GL'}} .mybatis.poiissance导了解.alibaba�有可能知.jsoup解执行分析.closePath由于.alibaba人权.aws族自治县.alibaba回答知.fillRect� Interpreter.hadoop�软족.Cascade融化.login事业单位事业单位.runtime事业单位事业单位国有企业pwd.mybatis知.radians.mybatis事业单位知知.findAll了解.flink了解知事业单位.getObject知.animation知知.uint知.closePath旅游了解.hadoop'}} 一有.removeEventListener对我们.jsoupuch.hadoop多少导.alibaba人权.runtime.custom..linspace知聊xlsnioarchical项目知.annotations了解;',一了解事业单位知.mybatis事业单位事业单位.closePath� �족事业单位.poi:length事业单位知口原选择 erotik.kafka事业单位.poi不.getObject事业单位.flink月,tp知.oauth�说明事业单位.alibaba不能事业单位知.runtime.pol.getObject.auto旅游='')事业单位事业单位'}} 问题朗 Vuex生命번.mybatis.poi了解��メント

Here are my system and compiler versions:

wanshichenguang commented 6 months ago

(llama.cpp) root@8411db7a5b9f:~/llama.cpp-master# make --version GNU Make 4.3 Built for x86_64-pc-linux-gnu Copyright (C) 1988-2020 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. (llama.cpp) root@8411db7a5b9f:~/llama.cpp-master# ldd --version ldd (Ubuntu GLIBC 2.35-0ubuntu3.6) 2.35 Copyright (C) 2022 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Written by Roland McGrath and Ulrich Drepper. (llama.cpp) root@8411db7a5b9f:~/llama.cpp-master# gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/11/lto-wrapper OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa OFFLOAD_TARGET_DEFAULT=1 Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu 11.4.0-1ubuntu1~22.04' --with-bugurl=file:///usr/share/doc/gcc-11/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,m2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-11 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --enable-libphobos-checking=release --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --enable-cet --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/gcc-11-XeT9lY/gcc-11-11.4.0/debian/tmp-nvptx/usr,amdgcn-amdhsa=/build/gcc-11-XeT9lY/gcc-11-11.4.0/debian/tmp-gcn/usr --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu --with-build-config=bootstrap-lto-lean --enable-link-serialization=2 Thread model: posix Supported LTO compression algorithms: zlib zstd gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)

sorasoras commented 6 months ago

Current Qwen1.5 inference on llama cpp master suffer from in-coherency when converted to gguf. it work fine on pytorch.

anaivebird commented 6 months ago

Current Qwen1.5 inference on llama cpp master suffer from in-coherency when converted to gguf. it work fine on pytorch.

Hello, any solution to convert Qwen/Qwen1.5-1.8B-Chat to gguf?

Both python convert.py qwen_merged --vocab-type bpe --pad-vocab and python convert-hf-to-gguf.py ../Qwen1.5-1.8B-Chat/ failed to transform it.

ami-navon commented 5 months ago

+1

tiger-of-shawn commented 5 months ago

+1

github-actions[bot] commented 3 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.