Android crashed and forcely rebooted when executing main_qwen_npu

taegeonum commented 2 months ago

Hello, I've execute main_qwen_npu folloing the guideline. In fact, there were minor bugs so I've manually fixed them. (e.g., missing adb push ../vocab/qwen_merges.txt ...).

When I ran main_qwen_npu, Android crahsed and forecly rebooted. I've logged where it is crashed using the following code in QNNExecutor.cpp:

    for (int i = 0; i < (int)net->subGraph().size(); ++i) {
        string name = graphNamingRule(i);
        auto &g = net->subGraph()[name];

        std::cout << "graph name: " << name << "\n" << std::flush;
        // cast graph to QNNGraph
        // the qnn_graph below is where we cast the Graph to QNNGraph
        auto expectedBackend = ctx->sub_backend_[i];
        if (graphOffloadRule(expectedBackend, i) == MLLM_CPU) {
            std::cout << "cpu backend\n" << std::flush;
            g->reshape();
            g->setUpTensors();
            std::cout << "cpu backend - reshape, setup tensors\n" << std::flush;
        } else if (graphOffloadRule(expectedBackend, i) == MLLM_QNN) {
            std::cout << "qnn backend\n" << std::flush;
            auto *qnn_graph = dynamic_cast<QNNGraph *>(g.get());
            std::cout << "qnn backend cast\n" << std::flush;
            g->reshape();
            std::cout << "qnn backend reshape\n" << std::flush;
            qnn_graph->setUpTensors(name);
            std::cout << "qnn backend setup tensors\n" << std::flush;
        } else {
            std::cerr << "Backend Not Support" << std::endl;
            exit(1);
        }
    }

When I set -c 0, Android crashes while casting Prompt_Graph.43 and performing qnn_graph->setUpTensors(name):

 graph name: Prompt_Graph.43
qnn backend
qnn backend cast
qnn backend reshape
(crash and reboot)

May I ask what the root cause of this problem is?

oreomaker commented 2 months ago

It seems that the problem occurs when constructing QNN computing graphs. The program might crash due to lack of memory. It's had to tell what is the reason from your message provided. Could you turn on the 'DEBUG' option in cmake and give more logs and the information of your test device?

taegeonum commented 2 months ago

@oreomaker Thanks for your help. I'm using S24U (Mem 12GB). This is the tail log. I'm wondering why it consumes a lot of memory, even if running a small model (qwen-1.8b quantized).

Memory Usage: 2069 MB(22435) at: before graph finilize

Memory Usage: 2086 MB(22490) at: after graph finilize
input tensors num:2
output tensors num:3
qnn backend setup tensors
graph name: Prompt_Graph.42
cpu backend
model.layers.10.self_attn.qkv_split     reshape:
    || Input outtensor-model.layers.10.self_attn.qkv_merge-00 shape: 1 1024 16 128 (2097152) |
    || Output outtensor-model.layers.10.self_attn.qkv_split-00 shape: 1 64 16 128 (131072) |Output outtensor-model.layers.10.self_attn.qkv_split-01 shape: 1 64 16 128 (131072) |Output outtensor-model.layers.10.self_attn.qkv_split-02 shape: 1 64 16 128 (131072) |Output outtensor-model.layers.10.self_attn.qkv_split-03 shape: 1 64 16 128 (131072) |
model.layers.10.self_attn.q_rope     reshape:
    || Input outtensor-model.layers.10.self_attn.qkv_split-00 shape: 1 64 16 128 (131072) |
    || Output outtensor-model.layers.10.self_attn.q_rope-00 shape: 1 64 16 128 (131072) |
model.layers.10.self_attn.k_rope     reshape:
    || Input outtensor-model.layers.10.self_attn.qkv_split-01 shape: 1 64 16 128 (131072) |
    || Output outtensor-model.layers.10.self_attn.k_rope-00 shape: 1 64 16 128 (131072) |
model.layers.10.self_attn.k_cache     reshape:
    || Input outtensor-model.layers.10.self_attn.k_rope-00 shape: 1 64 16 128 (131072) |
    || Output outtensor-model.layers.10.self_attn.k_cache-00 shape: 1 64 16 128 (131072) |
model.layers.10.self_attn.v_cache     reshape:
    || Input outtensor-model.layers.10.self_attn.qkv_split-02 shape: 1 16 128 64 (131072) |
    || Output outtensor-model.layers.10.self_attn.v_cache-00 shape: 1 64 16 128 (131072) |
model.layers.10.self_attn.qk     reshape:
    || Input outtensor-model.layers.10.self_attn.q_rope-00 shape: 1 64 16 128 (131072) |Input outtensor-model.layers.10.self_attn.k_cache-00 shape: 1 64 16 128 (131072) |
    || Output outtensor-model.layers.10.self_attn.qk-00 shape: 1 64 16 64 (65536) |
model.layers.10.self_attn.softmax     reshape:
    || Input outtensor-model.layers.10.self_attn.qk-00 shape: 1 64 16 64 (65536) |
    || Output outtensor-model.layers.10.self_attn.softmax-00 shape: 1 64 16 64 (65536) |
model.layers.10.self_attn.qkv     reshape:
    || Input outtensor-model.layers.10.self_attn.softmax-00 shape: 1 64 16 64 (65536) |Input outtensor-model.layers.10.self_attn.v_cache-00 shape: 1 16 128 64 (131072) |
    || Output outtensor-model.layers.10.self_attn.qkv-00 shape: 1 64 16 128 (131072) |
model.layers.10.self_attn.o_proj.quantize     reshape:
    || Input outtensor-model.layers.10.self_attn.qkv-00 shape: 1 64 16 128 (131072) |
    || Output outtensor-model.layers.10.self_attn.o_proj.quantize-00 shape: 1 64 16 128 (131072) |
model.layers.10.self_attn.or_merge     reshape:
    || Input outtensor-model.layers.10.self_attn.o_proj.quantize-00 shape: 1 64 16 128 (131072) |Input outtensor-model.layers.10.self_attn.qkv_split-03 shape: 1 64 16 128 (131072) |
    || Output outtensor-model.layers.10.self_attn.or_merge-00 shape: 1 320 16 128 (655360) |
---------QNN alloc
cpu backend - reshape, setup tensors
graph name: Prompt_Graph.43
qnn backend
qnn backend cast
model.layers.10.self_attn.or_split     reshape:
    || Input outtensor-model.layers.10.self_attn.or_merge-00 shape: 1 320 16 128 (655360) |
    || Output outtensor-model.layers.10.self_attn.or_split-00 shape: 1 64 16 128 (131072) |Output outtensor-model.layers.10.self_attn.or_split-01 shape: 1 64 16 128 (131072) |
model.layers.10.self_attn.or_split-00_view_     reshape:
    || Input outtensor-model.layers.10.self_attn.or_split-00 shape: 1 64 16 128 (131072) |
    || Output outtensor-model.layers.10.self_attn.or_split-00_view_-00 shape: 1 32 2 2048 (131072) |
model.layers.10.self_attn.or_split-01_view_     reshape:
    || Input outtensor-model.layers.10.self_attn.or_split-01 shape: 1 64 16 128 (131072) |
    || Output outtensor-model.layers.10.self_attn.or_split-01_view_-00 shape: 1 64 1 2048 (131072) |
model.layers.10.self_attn.o_proj     reshape:
    || Input outtensor-model.layers.10.self_attn.or_split-00_view_-00 shape: 1 32 2 2048 (131072) |
    || Output outtensor-model.layers.10.self_attn.o_proj-00 shape: 1 32 2 2048 (131072) |
model.layers.10.self_attn.o_proj.dequantize     reshape:
    || Input outtensor-model.layers.10.self_attn.o_proj-00 shape: 1 32 2 2048 (131072) |
    || Output outtensor-model.layers.10.self_attn.o_proj.dequantize-00 shape: 1 32 2 2048 (131072) |
model.layers.10.self_attn.o_proj.dequantize-00_view_     reshape:
    || Input outtensor-model.layers.10.self_attn.o_proj.dequantize-00 shape: 1 32 2 2048 (131072) |
    || Output outtensor-model.layers.10.self_attn.o_proj.dequantize-00_view_-00 shape: 1 64 1 2048 (131072) |
model.layers.10.self_attn.o_proj.dequantize-00_view_-00_add_     reshape:
    || Input outtensor-model.layers.10.self_attn.o_proj.dequantize-00_view_-00 shape: 1 64 1 2048 (131072) |Input outtensor-model.layers.10.self_attn.or_split-01_view_-00 shape: 1 64 1 2048 (131072) |
    || Output outtensor-model.layers.10.self_attn.o_proj.dequantize-00_view_-00_add_-00 shape: 1 64 1 2048 (131072) |
model.layers.10.post_attention_layernorm     reshape:
    || Input outtensor-model.layers.10.self_attn.o_proj.dequantize-00_view_-00_add_-00 shape: 1 64 1 2048 (131072) |
    || Output outtensor-model.layers.10.post_attention_layernorm-00 shape: 1 64 1 2048 (131072) |
model.layers.10.mlp.up_proj.quantize     reshape:
    || Input outtensor-model.layers.10.post_attention_layernorm-00 shape: 1 64 1 2048 (131072) |
    || Output outtensor-model.layers.10.mlp.up_proj.quantize-00 shape: 1 64 1 2048 (131072) |
model.layers.10.mlp.up_proj.quantize-00_view_     reshape:
    || Input outtensor-model.layers.10.mlp.up_proj.quantize-00 shape: 1 64 1 2048 (131072) |
    || Output outtensor-model.layers.10.mlp.up_proj.quantize-00_view_-00 shape: 1 32 2 2048 (131072) |
model.layers.10.mlp.gate_proj     reshape:
    || Input outtensor-model.layers.10.mlp.up_proj.quantize-00_view_-00 shape: 1 32 2 2048 (131072) |
    || Output outtensor-model.layers.10.mlp.gate_proj-00 shape: 1 32 2 5504 (352256) |
model.layers.10.mlp.up_proj     reshape:
    || Input outtensor-model.layers.10.mlp.up_proj.quantize-00_view_-00 shape: 1 32 2 2048 (131072) |
    || Output outtensor-model.layers.10.mlp.up_proj-00 shape: 1 32 2 5504 (352256) |
model.layers.10.mlp.gate_proj.dequantize     reshape:
    || Input outtensor-model.layers.10.mlp.gate_proj-00 shape: 1 32 2 5504 (352256) |
    || Output outtensor-model.layers.10.mlp.gate_proj.dequantize-00 shape: 1 32 2 5504 (352256) |
model.layers.10.mlp.up_proj.dequantize     reshape:
    || Input outtensor-model.layers.10.mlp.up_proj-00 shape: 1 32 2 5504 (352256) |
    || Output outtensor-model.layers.10.mlp.up_proj.dequantize-00 shape: 1 32 2 5504 (352256) |
model.layers.10.mlp.silu     reshape:
    || Input outtensor-model.layers.10.mlp.gate_proj.dequantize-00 shape: 1 32 2 5504 (352256) |
    || Output outtensor-model.layers.10.mlp.silu-00 shape: 1 32 2 5504 (352256) |
model.layers.10.mlp.silu-00_mul_     reshape:
    || Input outtensor-model.layers.10.mlp.silu-00 shape: 1 32 2 5504 (352256) |Input outtensor-model.layers.10.mlp.up_proj.dequantize-00 shape: 1 32 2 5504 (352256) |
    || Output outtensor-model.layers.10.mlp.silu-00_mul_-00 shape: 1 32 2 5504 (352256) |
model.layers.10.mlp.down_proj.quantize     reshape:
    || Input outtensor-model.layers.10.mlp.silu-00_mul_-00 shape: 1 32 2 5504 (352256) |
    || Output outtensor-model.layers.10.mlp.down_proj.quantize-00 shape: 1 32 2 5504 (352256) |
model.layers.10.mlp.down_proj     reshape:
    || Input outtensor-model.layers.10.mlp.down_proj.quantize-00 shape: 1 32 2 5504 (352256) |
    || Output outtensor-model.layers.10.mlp.down_proj-00 shape: 1 32 2 2048 (131072) |
model.layers.10.mlp.down_proj.dequantize     reshape:
    || Input outtensor-model.layers.10.mlp.down_proj-00 shape: 1 32 2 2048 (131072) |
    || Output outtensor-model.layers.10.mlp.down_proj.dequantize-00 shape: 1 32 2 2048 (131072) |
model.layers.10.mlp.down_proj.dequantize-00_view_     reshape:
    || Input outtensor-model.layers.10.mlp.down_proj.dequantize-00 shape: 1 32 2 2048 (131072) |
    || Output outtensor-model.layers.10.mlp.down_proj.dequantize-00_view_-00 shape: 1 64 1 2048 (131072) |
model.layers.10.mlp.down_proj.dequantize-00_view_-00_add_     reshape:
    || Input outtensor-model.layers.10.mlp.down_proj.dequantize-00_view_-00 shape: 1 64 1 2048 (131072) |Input outtensor-model.layers.10.self_attn.o_proj.dequantize-00_view_-00_add_-00 shape: 1 64 1 2048 (131072) |
    || Output outtensor-model.layers.10.mlp.down_proj.dequantize-00_view_-00_add_-00 shape: 1 64 1 2048 (131072) |
qnn backend reshape
---------QNN alloc
---------QNN alloc
model.layers.10.self_attn.or_split-00_view_ input type:16
model.layers.10.self_attn.or_split-00_view_ output type:16
model.layers.10.self_attn.or_split-00_view_is QNN INT8 op
model.layers.10.self_attn.or_split-01_view_ input type:0
model.layers.10.self_attn.or_split-01_view_ output type:0
model.layers.10.self_attn.o_proj.dequantize-00_view_ input type:0
model.layers.10.self_attn.o_proj.dequantize-00_view_ output type:0
model.layers.10.mlp.up_proj.quantize-00_view_ input type:16
model.layers.10.mlp.up_proj.quantize-00_view_ output type:16
model.layers.10.mlp.up_proj.quantize-00_view_is QNN INT8 op
model.layers.10.mlp.down_proj.dequantize-00_view_ input type:0
model.layers.10.mlp.down_proj.dequantize-00_view_ output type:0
(crash)

bingo787 commented 2 months ago

It is Segmentation fault too on OPPO Find X7 Ultral (Snapdrag8Gen3， DDR = 16GB) !

load time: 1474.77 ms token time: nan ms inference speed: nan tokens/s load time: 2678.93 ms token time: nan ms inference speed: nan tokens/s Segmentation fault

liang1232018 commented 2 months ago

@oreomaker Thanks for your help. I'm using S24U (Mem 12GB). This is the tail log. I'm wondering why it consumes a lot of memory, even if running a small model (qwen-1.8b quantized).

Memory Usage: 2069 MB(22435) at: before graph finilize

Memory Usage: 2086 MB(22490) at: after graph finilize
input tensors num:2
output tensors num:3
qnn backend setup tensors
graph name: Prompt_Graph.42
cpu backend
model.layers.10.self_attn.qkv_split     reshape:
    || Input outtensor-model.layers.10.self_attn.qkv_merge-00 shape: 1 1024 16 128 (2097152) |
    || Output outtensor-model.layers.10.self_attn.qkv_split-00 shape: 1 64 16 128 (131072) |Output outtensor-model.layers.10.self_attn.qkv_split-01 shape: 1 64 16 128 (131072) |Output outtensor-model.layers.10.self_attn.qkv_split-02 shape: 1 64 16 128 (131072) |Output outtensor-model.layers.10.self_attn.qkv_split-03 shape: 1 64 16 128 (131072) |
model.layers.10.self_attn.q_rope     reshape:
    || Input outtensor-model.layers.10.self_attn.qkv_split-00 shape: 1 64 16 128 (131072) |
    || Output outtensor-model.layers.10.self_attn.q_rope-00 shape: 1 64 16 128 (131072) |
model.layers.10.self_attn.k_rope     reshape:
    || Input outtensor-model.layers.10.self_attn.qkv_split-01 shape: 1 64 16 128 (131072) |
    || Output outtensor-model.layers.10.self_attn.k_rope-00 shape: 1 64 16 128 (131072) |
model.layers.10.self_attn.k_cache     reshape:
    || Input outtensor-model.layers.10.self_attn.k_rope-00 shape: 1 64 16 128 (131072) |
    || Output outtensor-model.layers.10.self_attn.k_cache-00 shape: 1 64 16 128 (131072) |
model.layers.10.self_attn.v_cache     reshape:
    || Input outtensor-model.layers.10.self_attn.qkv_split-02 shape: 1 16 128 64 (131072) |
    || Output outtensor-model.layers.10.self_attn.v_cache-00 shape: 1 64 16 128 (131072) |
model.layers.10.self_attn.qk     reshape:
    || Input outtensor-model.layers.10.self_attn.q_rope-00 shape: 1 64 16 128 (131072) |Input outtensor-model.layers.10.self_attn.k_cache-00 shape: 1 64 16 128 (131072) |
    || Output outtensor-model.layers.10.self_attn.qk-00 shape: 1 64 16 64 (65536) |
model.layers.10.self_attn.softmax     reshape:
    || Input outtensor-model.layers.10.self_attn.qk-00 shape: 1 64 16 64 (65536) |
    || Output outtensor-model.layers.10.self_attn.softmax-00 shape: 1 64 16 64 (65536) |
model.layers.10.self_attn.qkv     reshape:
    || Input outtensor-model.layers.10.self_attn.softmax-00 shape: 1 64 16 64 (65536) |Input outtensor-model.layers.10.self_attn.v_cache-00 shape: 1 16 128 64 (131072) |
    || Output outtensor-model.layers.10.self_attn.qkv-00 shape: 1 64 16 128 (131072) |
model.layers.10.self_attn.o_proj.quantize     reshape:
    || Input outtensor-model.layers.10.self_attn.qkv-00 shape: 1 64 16 128 (131072) |
    || Output outtensor-model.layers.10.self_attn.o_proj.quantize-00 shape: 1 64 16 128 (131072) |
model.layers.10.self_attn.or_merge     reshape:
    || Input outtensor-model.layers.10.self_attn.o_proj.quantize-00 shape: 1 64 16 128 (131072) |Input outtensor-model.layers.10.self_attn.qkv_split-03 shape: 1 64 16 128 (131072) |
    || Output outtensor-model.layers.10.self_attn.or_merge-00 shape: 1 320 16 128 (655360) |
---------QNN alloc
cpu backend - reshape, setup tensors
graph name: Prompt_Graph.43
qnn backend
qnn backend cast
model.layers.10.self_attn.or_split     reshape:
    || Input outtensor-model.layers.10.self_attn.or_merge-00 shape: 1 320 16 128 (655360) |
    || Output outtensor-model.layers.10.self_attn.or_split-00 shape: 1 64 16 128 (131072) |Output outtensor-model.layers.10.self_attn.or_split-01 shape: 1 64 16 128 (131072) |
model.layers.10.self_attn.or_split-00_view_     reshape:
    || Input outtensor-model.layers.10.self_attn.or_split-00 shape: 1 64 16 128 (131072) |
    || Output outtensor-model.layers.10.self_attn.or_split-00_view_-00 shape: 1 32 2 2048 (131072) |
model.layers.10.self_attn.or_split-01_view_     reshape:
    || Input outtensor-model.layers.10.self_attn.or_split-01 shape: 1 64 16 128 (131072) |
    || Output outtensor-model.layers.10.self_attn.or_split-01_view_-00 shape: 1 64 1 2048 (131072) |
model.layers.10.self_attn.o_proj     reshape:
    || Input outtensor-model.layers.10.self_attn.or_split-00_view_-00 shape: 1 32 2 2048 (131072) |
    || Output outtensor-model.layers.10.self_attn.o_proj-00 shape: 1 32 2 2048 (131072) |
model.layers.10.self_attn.o_proj.dequantize     reshape:
    || Input outtensor-model.layers.10.self_attn.o_proj-00 shape: 1 32 2 2048 (131072) |
    || Output outtensor-model.layers.10.self_attn.o_proj.dequantize-00 shape: 1 32 2 2048 (131072) |
model.layers.10.self_attn.o_proj.dequantize-00_view_     reshape:
    || Input outtensor-model.layers.10.self_attn.o_proj.dequantize-00 shape: 1 32 2 2048 (131072) |
    || Output outtensor-model.layers.10.self_attn.o_proj.dequantize-00_view_-00 shape: 1 64 1 2048 (131072) |
model.layers.10.self_attn.o_proj.dequantize-00_view_-00_add_     reshape:
    || Input outtensor-model.layers.10.self_attn.o_proj.dequantize-00_view_-00 shape: 1 64 1 2048 (131072) |Input outtensor-model.layers.10.self_attn.or_split-01_view_-00 shape: 1 64 1 2048 (131072) |
    || Output outtensor-model.layers.10.self_attn.o_proj.dequantize-00_view_-00_add_-00 shape: 1 64 1 2048 (131072) |
model.layers.10.post_attention_layernorm     reshape:
    || Input outtensor-model.layers.10.self_attn.o_proj.dequantize-00_view_-00_add_-00 shape: 1 64 1 2048 (131072) |
    || Output outtensor-model.layers.10.post_attention_layernorm-00 shape: 1 64 1 2048 (131072) |
model.layers.10.mlp.up_proj.quantize     reshape:
    || Input outtensor-model.layers.10.post_attention_layernorm-00 shape: 1 64 1 2048 (131072) |
    || Output outtensor-model.layers.10.mlp.up_proj.quantize-00 shape: 1 64 1 2048 (131072) |
model.layers.10.mlp.up_proj.quantize-00_view_     reshape:
    || Input outtensor-model.layers.10.mlp.up_proj.quantize-00 shape: 1 64 1 2048 (131072) |
    || Output outtensor-model.layers.10.mlp.up_proj.quantize-00_view_-00 shape: 1 32 2 2048 (131072) |
model.layers.10.mlp.gate_proj     reshape:
    || Input outtensor-model.layers.10.mlp.up_proj.quantize-00_view_-00 shape: 1 32 2 2048 (131072) |
    || Output outtensor-model.layers.10.mlp.gate_proj-00 shape: 1 32 2 5504 (352256) |
model.layers.10.mlp.up_proj     reshape:
    || Input outtensor-model.layers.10.mlp.up_proj.quantize-00_view_-00 shape: 1 32 2 2048 (131072) |
    || Output outtensor-model.layers.10.mlp.up_proj-00 shape: 1 32 2 5504 (352256) |
model.layers.10.mlp.gate_proj.dequantize     reshape:
    || Input outtensor-model.layers.10.mlp.gate_proj-00 shape: 1 32 2 5504 (352256) |
    || Output outtensor-model.layers.10.mlp.gate_proj.dequantize-00 shape: 1 32 2 5504 (352256) |
model.layers.10.mlp.up_proj.dequantize     reshape:
    || Input outtensor-model.layers.10.mlp.up_proj-00 shape: 1 32 2 5504 (352256) |
    || Output outtensor-model.layers.10.mlp.up_proj.dequantize-00 shape: 1 32 2 5504 (352256) |
model.layers.10.mlp.silu     reshape:
    || Input outtensor-model.layers.10.mlp.gate_proj.dequantize-00 shape: 1 32 2 5504 (352256) |
    || Output outtensor-model.layers.10.mlp.silu-00 shape: 1 32 2 5504 (352256) |
model.layers.10.mlp.silu-00_mul_     reshape:
    || Input outtensor-model.layers.10.mlp.silu-00 shape: 1 32 2 5504 (352256) |Input outtensor-model.layers.10.mlp.up_proj.dequantize-00 shape: 1 32 2 5504 (352256) |
    || Output outtensor-model.layers.10.mlp.silu-00_mul_-00 shape: 1 32 2 5504 (352256) |
model.layers.10.mlp.down_proj.quantize     reshape:
    || Input outtensor-model.layers.10.mlp.silu-00_mul_-00 shape: 1 32 2 5504 (352256) |
    || Output outtensor-model.layers.10.mlp.down_proj.quantize-00 shape: 1 32 2 5504 (352256) |
model.layers.10.mlp.down_proj     reshape:
    || Input outtensor-model.layers.10.mlp.down_proj.quantize-00 shape: 1 32 2 5504 (352256) |
    || Output outtensor-model.layers.10.mlp.down_proj-00 shape: 1 32 2 2048 (131072) |
model.layers.10.mlp.down_proj.dequantize     reshape:
    || Input outtensor-model.layers.10.mlp.down_proj-00 shape: 1 32 2 2048 (131072) |
    || Output outtensor-model.layers.10.mlp.down_proj.dequantize-00 shape: 1 32 2 2048 (131072) |
model.layers.10.mlp.down_proj.dequantize-00_view_     reshape:
    || Input outtensor-model.layers.10.mlp.down_proj.dequantize-00 shape: 1 32 2 2048 (131072) |
    || Output outtensor-model.layers.10.mlp.down_proj.dequantize-00_view_-00 shape: 1 64 1 2048 (131072) |
model.layers.10.mlp.down_proj.dequantize-00_view_-00_add_     reshape:
    || Input outtensor-model.layers.10.mlp.down_proj.dequantize-00_view_-00 shape: 1 64 1 2048 (131072) |Input outtensor-model.layers.10.self_attn.o_proj.dequantize-00_view_-00_add_-00 shape: 1 64 1 2048 (131072) |
    || Output outtensor-model.layers.10.mlp.down_proj.dequantize-00_view_-00_add_-00 shape: 1 64 1 2048 (131072) |
qnn backend reshape
---------QNN alloc
---------QNN alloc
model.layers.10.self_attn.or_split-00_view_ input type:16
model.layers.10.self_attn.or_split-00_view_ output type:16
model.layers.10.self_attn.or_split-00_view_is QNN INT8 op
model.layers.10.self_attn.or_split-01_view_ input type:0
model.layers.10.self_attn.or_split-01_view_ output type:0
model.layers.10.self_attn.o_proj.dequantize-00_view_ input type:0
model.layers.10.self_attn.o_proj.dequantize-00_view_ output type:0
model.layers.10.mlp.up_proj.quantize-00_view_ input type:16
model.layers.10.mlp.up_proj.quantize-00_view_ output type:16
model.layers.10.mlp.up_proj.quantize-00_view_is QNN INT8 op
model.layers.10.mlp.down_proj.dequantize-00_view_ input type:0
model.layers.10.mlp.down_proj.dequantize-00_view_ output type:0
(crash)

Hi,

Thank you for providing the precise debug log. The issue is mostly likely caused by insufficient memory. It is recommended to use a device with 16GB or more to execute seq = 64 prefilling. The large memory footprint is not due to the mllm-NPU, but rather the Qualcomm QNN framework, which performs NPU graph finalization for optimizing performance. You can find a log and the corresponding code for QNN graph finalization. You may try reducing the sequence to 32 to save on memory usage. I will try to reproduce the bug using a 12GB smartphone.

Besides, when the crash occurred, were there any other QNN logs, such as:

[ ERROR ] <E> Failed to map weights buffer to device!

If such logs are present, it would more clearly indicate a memory insufficiency issue.

On the other hand, if such logs are not available, but the device experiences a black screen, no responses from the adb shell terminal (frozen), and then rebooting the device, it likely also indicates a memory insufficiency issue.

Additionally, does this memory bug only occur when -c = 0? If that's the case, I might need to further investigate the single chunk prefilling. In my experience, seq = 64 typically requires around 10GB of memory. Thus, a total of 12GB memory is not enough, due to the OS memory usage.

liang1232018 commented 2 months ago

It is Segmentation fault too on OPPO Find X7 Ultral (Snapdrag8Gen3， DDR = 16GB) !

load time: 1474.77 ms token time: nan ms inference speed: nan tokens/s load time: 2678.93 ms token time: nan ms inference speed: nan tokens/s Segmentation fault

Hi,

Thank you for your bug report. Could you please provide us a detailed log by doing this.

Could you turn on the 'DEBUG' option in cmake and give more logs and the information of your test device?

I think the bug for you is not during the building graph stage, but during the prefilling and decoding stage, which is not similar as @taegeonum does, since your execution has printed the inference finishing log.

If more detailed logs are available, they would greatly assist us in identifying and resolving the issue. Thank you very much for your willingness to help.

zhuipiaochen commented 2 months ago

How about this issue?: WechatIMG16040

taegeonum commented 2 months ago

@liang1232018 Thanks for your support and explanation.

When s=64, This error happens regardless of c value, and it crashes silently without any exception, and the following situation happens as you mentioned:

On the other hand, if such logs are not available, but the device experiences a black screen, no responses from the adb shell terminal (frozen), and then rebooting the device, it likely also indicates a memory insufficiency issue.

When s=32, the exception is different. Ins=32 and c=1, the following exception occurs. It happens in the prefill phase while executing npuExe.run:


[Q] <|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Give me a short introduction to large language model.<|im_end|>
<|im_start|>assistant

[A] 5311.7ms [ ERROR ] Number of input elements 65536 does not match number of output elements 0.

  5311.7ms [ ERROR ] Op specific validation failed.

     0.0ms [ ERROR ]  <E> validateNativeOps master op validator model.layers.0.self_attn.ires_split-00_view_:qti.aisw:Reshape failed 3110

     0.0ms [ ERROR ]  <E> QnnBackend_validateOpConfig failed 3110

     0.0ms [ ERROR ]  <E> Failed to validate op model.layers.0.self_attn.ires_split-00_view_ with error 0xc26

[ ERROR ] QnnModel::addNode() validating node model.layers.0.self_attn.ires_split-00_view_ failed.
[ ERROR ] qnnModels_[qnnModelIndex_].addNode( QNN_OPCONFIG_VERSION_1, name.c_str(), packageName.c_str(), nodeType.c_str(), paramsPtr, params.size(), inputTensorNames, inputTensorNames.size(), outputTensors.data(), outputTensors.size() ) expected MODEL_NO_ERROR, got MODEL_GRAPH_ERROR
     0.0ms [ ERROR ]  <E> Cannot destroy HexNN graph as PrepreLib is not loaded

     0.0ms [ ERROR ]  <E> Failed to destroy hexNNGraphHandle 0xaea28280

     0.0ms [ ERROR ]  <E> Failed to clean up hexNNGraph in htpGraph 0x1 with error 1000

     0.0ms [WARNING]  <W> Final cleanup: failed to clean up backend.

     0.0ms [WARNING]  <W> sg_stubPtr is not null, skip loadRemoteSymbols

In s=32 and c=0, it crashes silently like s=64 setting.

bingo787 commented 2 months ago

Select Snapdragon SoCs support multiple cDSP process domains (PDs). Each process domain supports a virtual address space of 3.75 GBs. qwen-1.5-1.8b-chat-int8.mllm model size 3.4GBs，There's not much left I guess may be associated with this

bingo787 commented 2 months ago

It is Segmentation fault too on OPPO Find X7 Ultral (Snapdrag8Gen3， DDR = 16GB) ! load time: 1474.77 ms token time: nan ms inference speed: nan tokens/s load time: 2678.93 ms token time: nan ms inference speed: nan tokens/s Segmentation fault

Hi,

Thank you for your bug report. Could you please provide us a detailed log by doing this.

Could you turn on the 'DEBUG' option in cmake and give more logs and the information of your test device?

I think the bug for you is not during the building graph stage, but during the prefilling and decoding stage, which is not similar as @taegeonum does, since your execution has printed the inference finishing log.

If more detailed logs are available, they would greatly assist us in identifying and resolving the issue. Thank you very much for your willingness to help.

OK， I Created new one at https://github.com/UbiquitousLearning/mllm/issues/117， detail log also show.

UbiquitousLearning / mllm

Android crashed and forcely rebooted when executing main_qwen_npu #114