ggerganov / llama.cpp

LLM inference in C/C++
MIT License
68.07k stars 9.76k forks source link

[CANN]Bug: CANN run error on OrangePi AI PRO #9423

Closed StudyingLover closed 2 months ago

StudyingLover commented 2 months ago

What happened?

INFO [                    main] build info | tid="255085751848992" timestamp=1726024154 build=3726 commit="b34e0234"
INFO [                    main] system info | tid="255085751848992" timestamp=1726024154 n_threads=4 n_threads_batch=4 total_threads=4 system_info="AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
INFO [                    main] HTTP server is listening | tid="255085751848992" timestamp=1726024154 n_threads_http="3" port="8000" hostname="127.0.0.1"
INFO [                    main] loading model | tid="255085751848992" timestamp=1726024154 n_threads_http="3" port="8000" hostname="127.0.0.1"
llama_model_loader: loaded meta data with 24 key-value pairs and 283 tensors from /root/model/glm-4-9b-chat.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = chatglm
llama_model_loader: - kv   1:                               general.name str              = glm-4-9b-chat
llama_model_loader: - kv   2:                     chatglm.context_length u32              = 131072
llama_model_loader: - kv   3:                   chatglm.embedding_length u32              = 4096
llama_model_loader: - kv   4:                chatglm.feed_forward_length u32              = 13696
llama_model_loader: - kv   5:                        chatglm.block_count u32              = 40
llama_model_loader: - kv   6:               chatglm.attention.head_count u32              = 32
llama_model_loader: - kv   7:            chatglm.attention.head_count_kv u32              = 2
llama_model_loader: - kv   8:   chatglm.attention.layer_norm_rms_epsilon f32              = 0.000000
llama_model_loader: - kv   9:                          general.file_type u32              = 7
llama_model_loader: - kv  10:               chatglm.rope.dimension_count u32              = 64
llama_model_loader: - kv  11:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  13:                         tokenizer.ggml.pre str              = chatglm-bpe
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,151552]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,151552]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,151073]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  17:            tokenizer.ggml.padding_token_id u32              = 151329
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 151329
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 151329
llama_model_loader: - kv  20:                tokenizer.ggml.eot_token_id u32              = 151336
llama_model_loader: - kv  21:            tokenizer.ggml.unknown_token_id u32              = 151329
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = ChatGLM4
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q8_0:  162 tensors
llm_load_vocab: special tokens cache size = 223
llm_load_vocab: token to piece cache size = 0.9732 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = chatglm
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151552
llm_load_print_meta: n_merges         = 151073
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 2
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 16
llm_load_print_meta: n_embd_k_gqa     = 256
llm_load_print_meta: n_embd_v_gqa     = 256
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.6e-07
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 13696
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 9B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 9.40 B
llm_load_print_meta: model size       = 9.30 GiB (8.50 BPW) 
llm_load_print_meta: general.name     = glm-4-9b-chat
llm_load_print_meta: BOS token        = 151329 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151329 '<|endoftext|>'
llm_load_print_meta: UNK token        = 151329 '<|endoftext|>'
llm_load_print_meta: PAD token        = 151329 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 151336 '<|user|>'
llm_load_print_meta: max token length = 1024
CANN error: EE1001: [PID: 3156] 2024-09-11-03:09:15.204.889 The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
        Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
        TraceBack (most recent call last):
        Get Allocation Granularity failed, runtime result = 207000[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
        ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:5244]
        The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]

  current device: 0, in function ggml_cann_init at /root/llama.cpp/ggml/src/ggml-cann.cpp:182
  aclrtMemGetAllocationGranularity( &prop, ACL_RT_MEM_ALLOC_GRANULARITY_RECOMMENDED, &info.devices[id].vmm_granularity)
/root/llama.cpp/ggml/src/ggml-cann.cpp:123: CANN error
/root/llama.cpp/build/ggml/src/libggml.so(+0x40464)[0xe7ffce710464]
/root/llama.cpp/build/ggml/src/libggml.so(ggml_abort+0x140)[0xe7ffce711630]
/root/llama.cpp/build/ggml/src/libggml.so(+0xc026c)[0xe7ffce79026c]
/root/llama.cpp/build/ggml/src/libggml.so(_Z14ggml_cann_infov+0x160)[0xe7ffce790fb0]
/root/llama.cpp/build/ggml/src/libggml.so(ggml_backend_cann_get_device_count+0xc)[0xe7ffce7914fc]
/root/llama.cpp/build/src/libllama.so(+0x77d44)[0xe7ffce917d44]
/root/llama.cpp/build/src/libllama.so(llama_load_model_from_file+0xe00)[0xe7ffce9599f0]
./build/bin/llama-server(+0xa7f78)[0xaaaad07e7f78]
./build/bin/llama-server(+0x6f298)[0xaaaad07af298]
./build/bin/llama-server(+0x18d44)[0xaaaad0758d44]
/lib/aarch64-linux-gnu/libc.so.6(+0x273fc)[0xe7ffce2473fc]
/lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98)[0xe7ffce2474cc]
./build/bin/llama-server(+0x1b470)[0xaaaad075b470]
Aborted (core dumped)

Name and Version

version: 3726 (b34e0234) built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for aarch64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

INFO [                    main] build info | tid="255085751848992" timestamp=1726024154 build=3726 commit="b34e0234"
INFO [                    main] system info | tid="255085751848992" timestamp=1726024154 n_threads=4 n_threads_batch=4 total_threads=4 system_info="AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
INFO [                    main] HTTP server is listening | tid="255085751848992" timestamp=1726024154 n_threads_http="3" port="8000" hostname="127.0.0.1"
INFO [                    main] loading model | tid="255085751848992" timestamp=1726024154 n_threads_http="3" port="8000" hostname="127.0.0.1"
llama_model_loader: loaded meta data with 24 key-value pairs and 283 tensors from /root/model/glm-4-9b-chat.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = chatglm
llama_model_loader: - kv   1:                               general.name str              = glm-4-9b-chat
llama_model_loader: - kv   2:                     chatglm.context_length u32              = 131072
llama_model_loader: - kv   3:                   chatglm.embedding_length u32              = 4096
llama_model_loader: - kv   4:                chatglm.feed_forward_length u32              = 13696
llama_model_loader: - kv   5:                        chatglm.block_count u32              = 40
llama_model_loader: - kv   6:               chatglm.attention.head_count u32              = 32
llama_model_loader: - kv   7:            chatglm.attention.head_count_kv u32              = 2
llama_model_loader: - kv   8:   chatglm.attention.layer_norm_rms_epsilon f32              = 0.000000
llama_model_loader: - kv   9:                          general.file_type u32              = 7
llama_model_loader: - kv  10:               chatglm.rope.dimension_count u32              = 64
llama_model_loader: - kv  11:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  13:                         tokenizer.ggml.pre str              = chatglm-bpe
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,151552]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,151552]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,151073]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  17:            tokenizer.ggml.padding_token_id u32              = 151329
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 151329
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 151329
llama_model_loader: - kv  20:                tokenizer.ggml.eot_token_id u32              = 151336
llama_model_loader: - kv  21:            tokenizer.ggml.unknown_token_id u32              = 151329
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = ChatGLM4
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q8_0:  162 tensors
llm_load_vocab: special tokens cache size = 223
llm_load_vocab: token to piece cache size = 0.9732 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = chatglm
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151552
llm_load_print_meta: n_merges         = 151073
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 2
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 16
llm_load_print_meta: n_embd_k_gqa     = 256
llm_load_print_meta: n_embd_v_gqa     = 256
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.6e-07
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 13696
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 9B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 9.40 B
llm_load_print_meta: model size       = 9.30 GiB (8.50 BPW) 
llm_load_print_meta: general.name     = glm-4-9b-chat
llm_load_print_meta: BOS token        = 151329 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151329 '<|endoftext|>'
llm_load_print_meta: UNK token        = 151329 '<|endoftext|>'
llm_load_print_meta: PAD token        = 151329 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 151336 '<|user|>'
llm_load_print_meta: max token length = 1024
CANN error: EE1001: [PID: 3156] 2024-09-11-03:09:15.204.889 The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
        Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
        TraceBack (most recent call last):
        Get Allocation Granularity failed, runtime result = 207000[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
        ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:5244]
        The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]

  current device: 0, in function ggml_cann_init at /root/llama.cpp/ggml/src/ggml-cann.cpp:182
  aclrtMemGetAllocationGranularity( &prop, ACL_RT_MEM_ALLOC_GRANULARITY_RECOMMENDED, &info.devices[id].vmm_granularity)
/root/llama.cpp/ggml/src/ggml-cann.cpp:123: CANN error
/root/llama.cpp/build/ggml/src/libggml.so(+0x40464)[0xe7ffce710464]
/root/llama.cpp/build/ggml/src/libggml.so(ggml_abort+0x140)[0xe7ffce711630]
/root/llama.cpp/build/ggml/src/libggml.so(+0xc026c)[0xe7ffce79026c]
/root/llama.cpp/build/ggml/src/libggml.so(_Z14ggml_cann_infov+0x160)[0xe7ffce790fb0]
/root/llama.cpp/build/ggml/src/libggml.so(ggml_backend_cann_get_device_count+0xc)[0xe7ffce7914fc]
/root/llama.cpp/build/src/libllama.so(+0x77d44)[0xe7ffce917d44]
/root/llama.cpp/build/src/libllama.so(llama_load_model_from_file+0xe00)[0xe7ffce9599f0]
./build/bin/llama-server(+0xa7f78)[0xaaaad07e7f78]
./build/bin/llama-server(+0x6f298)[0xaaaad07af298]
./build/bin/llama-server(+0x18d44)[0xaaaad0758d44]
/lib/aarch64-linux-gnu/libc.so.6(+0x273fc)[0xe7ffce2473fc]
/lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98)[0xe7ffce2474cc]
./build/bin/llama-server(+0x1b470)[0xaaaad075b470]
Aborted (core dumped)
hipudding commented 2 months ago

Thank you for this issue. Now, llama.cpp is not support 310B(which is the NPU type in OrangePi), We warmly welcome the adaptation for OrangePi.

Please use feature request instead of bug report.

StudyingLover commented 2 months ago

Okay, do I need to reopen an issue? I couldn't find where to change the label. btw, I can provide an SSH address for an Orange Pi to assist with community development. We really need help with running large model inference on the Orange Pi AI Pro.

hipudding commented 2 months ago

You can create another issue and close this one. 310B chip support involves a certain amount of workload, and currently, we don’t have enough manpower to support it. If you’re interested, you’re welcome to contribute support for the 310B.

StudyingLover commented 2 months ago

ok thanks