cmp-nct / ggllm.cpp

Falcon LLM ggml framework with CPU and GPU support
Other
244 stars 21 forks source link

Debug Timings No Longer Working #55

Closed boricuapab closed 1 year ago

boricuapab commented 1 year ago

Seems like the debug timings flag isn't working using the new ggcc models

C:\falcGGML\ggllm.cpp\build\bin\Release>title falcon_main.cpp

C:\falcGGML\ggllm.cpp\build\bin\Release>falcon_main -t 4 -m h2ogpt-gm-oasst1-en-2048-falcon-7b-v2.ggccv1.q5_1.bin --color -c 2048 -p "Tell me a story about robot falcons from outer space.\n### Response:" -s 1686779952 --gpu-reserve-mb-main 100 --debug-timings 1
main: build = 860 (1d6e234)
falcon.cpp: loading model from h2ogpt-gm-oasst1-en-2048-falcon-7b-v2.ggccv1.q5_1.bin
falcon.cpp: file version 10
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
|          Info |     format | n_vocab |   n_bpe | n_ctx | n_embd |   n_head ; kv | n_layer | falcon | ftype |   n_ff |
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
|               |    ggcc v1 |   65024 |   64784 |  2048 |   4544 |      71 ;   1 |      32 |  7; 7B |     9 |  18176 |
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
falcon_model_load_internal: ggml ctx size =    0.00 MB (mmap size = 5163.00 MB)
falcon_model_load_internal: using CUDA for GPU acceleration
falcon_model_load_internal: VRAM free: 7129.00 MB  of 8191.00 MB (in use: 1062.00 MB)
falcon_model_load_internal: allocating batch_size x 1 MB = 0 MB VRAM for the scratch buffer
falcon_model_load_internal: Offloading Output head tensor (211 MB)
falcon_model_load_internal: mem required  =  370.24 MB (+   32.00 MB per state)
falcon_model_load_internal: offloading 32 of 32 layers to GPU, weights offloaded 5117.03 MB
falcon_model_load_internal: estimated VRAM usage: 5150 MB
[==================================================] 100%  Tensors populated, CUDA ready
falcon_context_prepare: Context falcon_main RAM buffers - key_val =   32.00 MB, Compute =  160.00 MB, Scratch 0 =  124.00 MB, Scratch 1 =   40.14 MB
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+
| ID | Device                     1 found | VRAM Total | VRAM Free | VRAM Used | Split at  |    Device |
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+
|  0 | NVIDIA GeForce RTX 2060 SUPER      |    8191 MB |   1885 MB |   6306 MB |      0.0% |   Primary |
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+

+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+
| Syst. Info | AVX | AVX2 | AVX512 | AVX512_VBMI | AVX512_VNNI | FMA | NEON | ARM_FMA | F16C | FP16_VA | SIMD | BLAS | SSE3 | VSX |
+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+
|  4/16 thrd | 1   | 1    | 0      | 0           | 0           | 1   | 0    | 0       | 1    | 0       | 0    | 1    | 1    | 0   |
+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+

+------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+--------+---------+
|   Sampling | rpt_n | rpt_p | prs_p | frq_p | top_k | tfs_z | top_p | typ_p | temp | miro | mir_lr | mir_ent |
+------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+--------+---------+
|            |    64 | 1.100 | 0.000 | 0.000 |    40 | 1.000 | 0.950 | 1.000 | 0.80 |    0 | 0.1000 | 5.00000 |
+============+=======+=======+=======+=======+=======+=======+-------+-------+------+------+--------+---------+
| Generation |   Ctx | Batch |  Keep | Prom. |          Seed |             Finetune | Stop |
+------------+-------+-------+-------+-------+---------------+----------------------+------+
|            |  2048 |     1 |     0 |    16 |    1686779952 |          UNSPECIFIED | #  1 |
+------------+-------+-------+-------+-------+---------------+----------------------+------+

Tell=== GRAPH ===
n_nodes = 1061

C:\falcGGML\ggllm.cpp\build\bin\Release>pause
Press any key to continue . . .
cmp-nct commented 1 year ago

Just pushed a bugfix. I wonder if it works in llama.cpp ? The performance counter handling was changed when they skipped useless threaded FINALIZE and INIT on most operations but it also caused perf counters to skip.

boricuapab commented 1 year ago

Working, thanks

C:\falcGGML\ggllm.cpp\build\bin\Release>title falcon_main.cpp

C:\falcGGML\ggllm.cpp\build\bin\Release>falcon_main -t 4 -m falcon-7b-instruct.ggccv1.q5_1.bin --color -c 16384 --temp 0.7 --repeat_penalty 1.1 -n -1 -i -ins -r "User" --gpu-reserve-mb-main 100 --debug-timings 1
main: info: context size is large (16384), reducing default temperature to 0.7.
main: build = 878 (60f82ca)
falcon.cpp: loading model from falcon-7b-instruct.ggccv1.q5_1.bin
falcon.cpp: file version 10
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
|          Info |     format | n_vocab |   n_bpe | n_ctx | n_embd |   n_head ; kv | n_layer | falcon | ftype |   n_ff |
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
|               |    ggcc v1 |   65024 |   64784 | 16384 |   4544 |      71 ;   1 |      32 |  7; 7B |     9 |  18176 |
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
falcon_model_load_internal: ggml ctx size =    0.00 MB (mmap size = 5163.00 MB)
falcon_model_load_internal: using CUDA for GPU acceleration
falcon_model_load_internal: VRAM free: 7125.00 MB  of 8191.00 MB (in use: 1066.00 MB)
falcon_model_load_internal: allocating batch_size x 1 MB = 0 MB VRAM for the scratch buffer
falcon_model_load_internal: Offloading Output head tensor (211 MB)
falcon_model_load_internal: mem required  =  374.25 MB (+  384.00 MB per state)
falcon_model_load_internal: offloading 32 of 32 layers to GPU, weights offloaded 5117.03 MB
falcon_model_load_internal: estimated VRAM usage: 5150 MB
[==================================================] 100%  Tensors populated, CUDA ready
falcon_context_prepare: Context falcon_main RAM buffers - key_val =  128.00 MB, Compute =  160.00 MB, Scratch 0 =  128.00 MB, Scratch 1 =   40.14 MB
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+
| ID | Device                     1 found | VRAM Total | VRAM Free | VRAM Used | Split at  |    Device |
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+
|  0 | NVIDIA GeForce RTX 2060 SUPER      |    8191 MB |   1885 MB |   6306 MB |      0.0% |   Primary |
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+

+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+
| Syst. Info | AVX | AVX2 | AVX512 | AVX512_VBMI | AVX512_VNNI | FMA | NEON | ARM_FMA | F16C | FP16_VA | SIMD | BLAS | SSE3 | VSX |
+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+
|  4/16 thrd | 1   | 1    | 0      | 0           | 0           | 1   | 0    | 0       | 1    | 0       | 0    | 1    | 1    | 0   |
+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+

main: interactive mode on.
Reverse prompt: 'User'
+------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+--------+---------+
|   Sampling | rpt_n | rpt_p | prs_p | frq_p | top_k | tfs_z | top_p | typ_p | temp | miro | mir_lr | mir_ent |
+------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+--------+---------+
|            |    64 | 1.100 | 0.000 | 0.000 |    40 | 1.000 | 0.950 | 1.000 | 0.70 |    0 | 0.1000 | 5.00000 |
+============+=======+=======+=======+=======+=======+=======+-------+-------+------+------+--------+---------+
| Generation |   Ctx | Batch |  Keep | Prom. |          Seed |             Finetune | Stop |
+------------+-------+-------+-------+-------+---------------+----------------------+------+
|            | 16384 |     1 |     0 |     0 |    1689497194 |       FALCONINSTRUCT | #  1 |
+------------+-------+-------+-------+-------+---------------+----------------------+------+

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to ggLLM.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

> Tell me about bumble bees
=== GRAPH ===
n_nodes = 997
 - Idx: [   S00,   S01, S02]x[   S10,   S11, S12]=[    D0,    D1,  D2]         Op_Label Gx (Runs) cpu =  Cycles / Avg_Cyc ms, wall =    Time / Avg_Time ms [Layer Name] Device  Impact
 -   0: [  4544, 65024,   1]x[     1,     1,   1]=[  4544,     1,   1]         GET_ROWS   (  1) cpu =   0.000 /   0.000 ms, wall =   0.018 /   0.018 ms [  0 node_0] [CPU]
 -   1: [  4544,     1,   1]x[    47,    47,  47]=[  4544,     1,   1]             NORM   (  3) cpu =   0.000 /   0.000 ms, wall =   0.013 /   0.004 ms [  1 node_1] [CPU]
 -   2: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              MUL   (  4) cpu =   1.000 /   0.250 ms, wall =   1.038 /   0.260 ms [  1 node_2] [CPU]
 -   3: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  4) cpu =   0.000 /   0.000 ms, wall =   0.006 /   0.002 ms [  1 inpFF] [CPU]
 -   4: [  4544,  4672,   1]x[  4544,     1,   1]=[  4672,     1,   1]          MUL_MAT   (  4) cpu =   2.000 /   0.500 ms, wall =   1.811 /   0.453 ms [  1 node_4] [GPUxQ]  (Slow)
 -   5: [  4672,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  1 Kcur] [CPU]
 -   6: [    64,     1,   1]x[     4,     1,   1]=[    64,     1,   1]             ROPE   (  4) cpu =   0.000 /   0.000 ms, wall =   0.009 /   0.002 ms [  1 Kcur (view)] [CPU]
 -   7: [33554432,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  1 k] [GPU]
 -   8: [    64,     1,   1]x[    64,     1,   1]=[    64,     1,   1]              CPY   (  4) cpu =   0.000 /   0.000 ms, wall =   0.004 /   0.001 ms [  1 k (copy of Kcur (view))] [CPU]
 -   9: [  4672,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  1 Vcur] [CPU]
 -  10: [    64,     1,   1]x[    47,    47,  47]=[     1,    64,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  1 Vcur (permuted)] [CPU]
 -  11: [33554432,     1,   1]x[    47,    47,  47]=[     1,    64,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  1 V] [GPU]
 -  12: [     1,    64,   1]x[     1,    64,   1]=[     1,    64,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  1 V_new (copy of Vcur (permuted))] [CPU]
 -  13: [  4544, 18176,   1]x[  4544,     1,   1]=[ 18176,     1,   1]          MUL_MAT   (  4) cpu =   6.000 /   1.500 ms, wall =   6.216 /   1.554 ms [  1 inpFF*ff_up] [GPUxQ]  (Slow)
 -  14: [ 18176,     1,   1]x[    47,    47,  47]=[ 18176,     1,   1]             GELU   (  4) cpu =   0.000 /   0.000 ms, wall =   0.087 /   0.022 ms [  1 inpFF*ff_up (view)] [CPU]
 -  15: [ 18176,  4544,   1]x[ 18176,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  3) cpu =   8.000 /   2.667 ms, wall =   8.058 /   2.686 ms [  1 gelu_cur*ff_down] [GPUxQ]  (Slow)
 -  16: [33554432,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  1 cache_k (view)] [GPU]
 -  17: [    64,     1,   1]x[    47,    47,  47]=[    64,     1,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  1 K] [CPU]
 -  18: [  4672,     1,   1]x[    47,    47,  47]=[    64,    71,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  1 Qcur] [CPU]
 -  19: [    64,    71,   1]x[     4,     1,   1]=[    64,    71,   1]             ROPE   (  4) cpu =   0.000 /   0.000 ms, wall =   0.020 /   0.005 ms [  1 Qcur (view)] [CPU]
 -  20: [    64,    71,   1]x[    47,    47,  47]=[    64,     1,  71]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  1 Q] [CPU]
 -  21: [    64,     1,   1]x[    64,     1,  71]=[     1,     1,  71]          MUL_MAT   (  4) cpu =   0.000 /   0.000 ms, wall =   0.011 /   0.003 ms [  1 KQ] [CPU]
 -  22: [     1,     1,  71]x[     1,     1,   1]=[     1,     1,  71]            SCALE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  1 KQ_scaled] [CPU]
 -  23: [     1,     1,  71]x[     2,     1,   1]=[     1,     1,  71]    DIAG_MASK_INF   (  4) cpu =   0.000 /   0.000 ms, wall =   0.004 /   0.001 ms [  1 KQ_masked] [CPU]
 -  24: [     1,     1,  71]x[    47,    47,  47]=[     1,     1,  71]         SOFT_MAX   (  4) cpu =   0.000 /   0.000 ms, wall =   0.004 /   0.001 ms [  1 KQ_soft_max] [CPU]
 -  25: [     1,    64,   1]x[     1,     1,  71]=[    64,     1,  71]          MUL_MAT   (  4) cpu =   0.000 /   0.000 ms, wall =   0.039 /   0.010 ms [  1 KQV] [CPU]
 -  26: [    64,     1,  71]x[    47,    47,  47]=[    64,    71,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  1 KQV_merged] [CPU]
 -  27: [    64,    71,   1]x[  4544,     1,   1]=[  4544,     1,   1]              CPY   (  4) cpu =   0.000 /   0.000 ms, wall =   0.008 /   0.002 ms [  1 KQV_merged (copy)] [CPU]
 -  28: [  4544,  4544,   1]x[  4544,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  4) cpu =   2.000 /   0.500 ms, wall =   1.684 /   0.421 ms [  1 result_wo] [GPUxQ]  (Slow)
 -  29: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              CPY   (  4) cpu =   0.000 /   0.000 ms, wall =   0.008 /   0.002 ms [  1 attn_out] [CPU]
 -  30: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  4) cpu =   0.000 /   0.000 ms, wall =   0.005 /   0.001 ms [  1 node_30] [CPU]
 -  31: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  4) cpu =   0.000 /   0.000 ms, wall =   0.005 /   0.001 ms [  1 inpFF_+_result_attn_out] [CPU]
 -  32: [  4544,     1,   1]x[    47,    47,  47]=[  4544,     1,   1]             NORM   (  4) cpu =   0.000 /   0.000 ms, wall =   0.010 /   0.003 ms [  2 node_32] [CPU]
 -  33: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              MUL   (  4) cpu =   0.000 /   0.000 ms, wall =   0.127 /   0.032 ms [  2 node_33] [CPU]
 -  34: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  4) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.001 ms [  2 inpFF] [CPU]
 -  35: [  4544,  4672,   1]x[  4544,     1,   1]=[  4672,     1,   1]          MUL_MAT   (  4) cpu =   2.000 /   0.500 ms, wall =   1.733 /   0.433 ms [  2 node_35] [GPUxQ]  (Slow)
 -  36: [  4672,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  2 Kcur] [CPU]
 -  37: [    64,     1,   1]x[     4,     1,   1]=[    64,     1,   1]             ROPE   (  4) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.001 ms [  2 Kcur (view)] [CPU]
 -  38: [33554432,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  2 k] [GPU]
 -  39: [    64,     1,   1]x[    64,     1,   1]=[    64,     1,   1]              CPY   (  4) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.000 ms [  2 k (copy of Kcur (view))] [CPU]
 -  40: [  4672,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  2 Vcur] [CPU]
 -  41: [    64,     1,   1]x[    47,    47,  47]=[     1,    64,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  2 Vcur (permuted)] [CPU]
 -  42: [33554432,     1,   1]x[    47,    47,  47]=[     1,    64,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  2 V] [GPU]
 -  43: [     1,    64,   1]x[     1,    64,   1]=[     1,    64,   1]              CPY   (  4) cpu =   0.000 /   0.000 ms, wall =   0.004 /   0.001 ms [  2 V_new (copy of Vcur (permuted))] [CPU]
 -  44: [  4544, 18176,   1]x[  4544,     1,   1]=[ 18176,     1,   1]          MUL_MAT   (  4) cpu =   6.000 /   1.500 ms, wall =   6.191 /   1.548 ms [  2 inpFF*ff_up] [GPUxQ]  (Slow)
 -  45: [ 18176,     1,   1]x[    47,    47,  47]=[ 18176,     1,   1]             GELU   (  4) cpu =   0.000 /   0.000 ms, wall =   0.035 /   0.009 ms [  2 inpFF*ff_up (view)] [CPU]
 -  46: [ 18176,  4544,   1]x[ 18176,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  4) cpu =  11.000 /   2.750 ms, wall =  10.298 /   2.575 ms [  2 gelu_cur*ff_down] [GPUxQ]  (Slow)
 -  47: [33554432,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  2 cache_k (view)] [GPU]
 -  48: [    64,     1,   1]x[    47,    47,  47]=[    64,     1,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  2 K] [CPU]
 -  49: [  4672,     1,   1]x[    47,    47,  47]=[    64,    71,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  2 Qcur] [CPU]
 -  50: [    64,    71,   1]x[     4,     1,   1]=[    64,    71,   1]             ROPE   (  4) cpu =   0.000 /   0.000 ms, wall =   0.020 /   0.005 ms [  2 Qcur (view)] [CPU]
 -  51: [    64,    71,   1]x[    47,    47,  47]=[    64,     1,  71]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  2 Q] [CPU]
 -  52: [    64,     1,   1]x[    64,     1,  71]=[     1,     1,  71]          MUL_MAT   (  4) cpu =   0.000 /   0.000 ms, wall =   0.008 /   0.002 ms [  2 KQ] [CPU]
 -  53: [     1,     1,  71]x[     1,     1,   1]=[     1,     1,  71]            SCALE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  2 KQ_scaled] [CPU]
 -  54: [     1,     1,  71]x[     2,     1,   1]=[     1,     1,  71]    DIAG_MASK_INF   (  4) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.000 ms [  2 KQ_masked] [CPU]
 -  55: [     1,     1,  71]x[    47,    47,  47]=[     1,     1,  71]         SOFT_MAX   (  4) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.000 ms [  2 KQ_soft_max] [CPU]
 -  56: [     1,    64,   1]x[     1,     1,  71]=[    64,     1,  71]          MUL_MAT   (  4) cpu =   0.000 /   0.000 ms, wall =   0.033 /   0.008 ms [  2 KQV] [CPU]
 -  57: [    64,     1,  71]x[    47,    47,  47]=[    64,    71,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  2 KQV_merged] [CPU]
 -  58: [    64,    71,   1]x[  4544,     1,   1]=[  4544,     1,   1]              CPY   (  4) cpu =   0.000 /   0.000 ms, wall =   0.004 /   0.001 ms [  2 KQV_merged (copy)] [CPU]
 -  59: [  4544,  4544,   1]x[  4544,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  4) cpu =   1.000 /   0.250 ms, wall =   1.695 /   0.424 ms [  2 result_wo] [GPUxQ]  (Slow)
 -  60: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              CPY   (  4) cpu =   0.000 /   0.000 ms, wall =   0.008 /   0.002 ms [  2 attn_out] [CPU]
 -  61: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  4) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.001 ms [  2 node_61] [CPU]
 -  62: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  4) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.001 ms [  2 inpFF_+_result_attn_out] [CPU]
 -  63: [  4544,     1,   1]x[    47,    47,  47]=[  4544,     1,   1]             NORM   (  4) cpu =   0.000 /   0.000 ms, wall =   0.007 /   0.002 ms [  3 node_63] [CPU]
 -  64: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              MUL   (  4) cpu =   0.000 /   0.000 ms, wall =   0.124 /   0.031 ms [  3 node_64] [CPU]
 -  65: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  4) cpu =   0.000 /   0.000 ms, wall =   0.005 /   0.001 ms [  3 inpFF] [CPU]
 -  66: [  4544,  4672,   1]x[  4544,     1,   1]=[  4672,     1,   1]          MUL_MAT   (  4) cpu =   2.000 /   0.500 ms, wall =   1.748 /   0.437 ms [  3 node_66] [GPUxQ]  (Slow)
 -  67: [  4672,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  3 Kcur] [CPU]
 -  68: [    64,     1,   1]x[     4,     1,   1]=[    64,     1,   1]             ROPE   (  4) cpu =   0.000 /   0.000 ms, wall =   0.004 /   0.001 ms [  3 Kcur (view)] [CPU]
 -  69: [33554432,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  3 k] [GPU]
 -  70: [    64,     1,   1]x[    64,     1,   1]=[    64,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  3 k (copy of Kcur (view))] [CPU]
 -  71: [  4672,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  3 Vcur] [CPU]
 -  72: [    64,     1,   1]x[    47,    47,  47]=[     1,    64,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  3 Vcur (permuted)] [CPU]
 -  73: [33554432,     1,   1]x[    47,    47,  47]=[     1,    64,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  3 V] [GPU]
 -  74: [     1,    64,   1]x[     1,    64,   1]=[     1,    64,   1]              CPY   (  4) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.000 ms [  3 V_new (copy of Vcur (permuted))] [CPU]
 -  75: [  4544, 18176,   1]x[  4544,     1,   1]=[ 18176,     1,   1]          MUL_MAT   (  4) cpu =   6.000 /   1.500 ms, wall =   6.191 /   1.548 ms [  3 inpFF*ff_up] [GPUxQ]  (Slow)
 -  76: [ 18176,     1,   1]x[    47,    47,  47]=[ 18176,     1,   1]             GELU   (  4) cpu =   0.000 /   0.000 ms, wall =   0.020 /   0.005 ms [  3 inpFF*ff_up (view)] [CPU]
 -  77: [ 18176,  4544,   1]x[ 18176,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  4) cpu =   7.000 /   1.750 ms, wall =   6.492 /   1.623 ms [  3 gelu_cur*ff_down] [GPUxQ]  (Slow)
 -  78: [33554432,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  3 cache_k (view)] [GPU]
 -  79: [    64,     1,   1]x[    47,    47,  47]=[    64,     1,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  3 K] [CPU]
 -  80: [  4672,     1,   1]x[    47,    47,  47]=[    64,    71,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  3 Qcur] [CPU]
 -  81: [    64,    71,   1]x[     4,     1,   1]=[    64,    71,   1]             ROPE   (  4) cpu =   0.000 /   0.000 ms, wall =   0.022 /   0.005 ms [  3 Qcur (view)] [CPU]
 -  82: [    64,    71,   1]x[    47,    47,  47]=[    64,     1,  71]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  3 Q] [CPU]
 -  83: [    64,     1,   1]x[    64,     1,  71]=[     1,     1,  71]          MUL_MAT   (  4) cpu =   0.000 /   0.000 ms, wall =   0.009 /   0.002 ms [  3 KQ] [CPU]
 -  84: [     1,     1,  71]x[     1,     1,   1]=[     1,     1,  71]            SCALE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  3 KQ_scaled] [CPU]
 -  85: [     1,     1,  71]x[     2,     1,   1]=[     1,     1,  71]    DIAG_MASK_INF   (  4) cpu =   0.000 /   0.000 ms, wall =   0.004 /   0.001 ms [  3 KQ_masked] [CPU]
 -  86: [     1,     1,  71]x[    47,    47,  47]=[     1,     1,  71]         SOFT_MAX   (  4) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.001 ms [  3 KQ_soft_max] [CPU]
 -  87: [     1,    64,   1]x[     1,     1,  71]=[    64,     1,  71]          MUL_MAT   (  4) cpu =   0.000 /   0.000 ms, wall =   0.037 /   0.009 ms [  3 KQV] [CPU]
 -  88: [    64,     1,  71]x[    47,    47,  47]=[    64,    71,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  3 KQV_merged] [CPU]
 -  89: [    64,    71,   1]x[  4544,     1,   1]=[  4544,     1,   1]              CPY   (  4) cpu =   0.000 /   0.000 ms, wall =   0.004 /   0.001 ms [  3 KQV_merged (copy)] [CPU]
 -  90: [  4544,  4544,   1]x[  4544,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  4) cpu =   2.000 /   0.500 ms, wall =   1.690 /   0.422 ms [  3 result_wo] [GPUxQ]  (Slow)
 -  91: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              CPY   (  4) cpu =   0.000 /   0.000 ms, wall =   0.004 /   0.001 ms [  3 attn_out] [CPU]
 -  92: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  4) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.001 ms [  3 node_92] [CPU]
 -  93: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  4) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.001 ms [  3 inpFF_+_result_attn_out] [CPU]

 - 900: [  4544,     1,   1]x[    47,    47,  47]=[  4544,     1,   1]             NORM   (  4) cpu =   0.000 /   0.000 ms, wall =   0.007 /   0.002 ms [ 30 node_900] [CPU]
 - 901: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              MUL   (  4) cpu =   0.000 /   0.000 ms, wall =   0.102 /   0.025 ms [ 30 node_901] [CPU]
 - 902: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  4) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.001 ms [ 30 inpFF] [CPU]
 - 903: [  4544,  4672,   1]x[  4544,     1,   1]=[  4672,     1,   1]          MUL_MAT   (  4) cpu =   0.000 /   0.000 ms, wall =   0.279 /   0.070 ms [ 30 node_903] [GPUxQ]
 - 904: [  4672,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 30 Kcur] [CPU]
 - 905: [    64,     1,   1]x[     4,     1,   1]=[    64,     1,   1]             ROPE   (  3) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.001 ms [ 30 Kcur (view)] [CPU]
 - 906: [33554432,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 30 k] [GPU]
 - 907: [    64,     1,   1]x[    64,     1,   1]=[    64,     1,   1]              CPY   (  4) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.000 ms [ 30 k (copy of Kcur (view))] [CPU]
 - 908: [  4672,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 30 Vcur] [CPU]
 - 909: [    64,     1,   1]x[    47,    47,  47]=[     1,    64,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 30 Vcur (permuted)] [CPU]
 - 910: [33554432,     1,   1]x[    47,    47,  47]=[     1,    64,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 30 V] [GPU]
 - 911: [     1,    64,   1]x[     1,    64,   1]=[     1,    64,   1]              CPY   (  4) cpu =   0.000 /   0.000 ms, wall =   0.004 /   0.001 ms [ 30 V_new (copy of Vcur (permuted))] [CPU]
 - 912: [  4544, 18176,   1]x[  4544,     1,   1]=[ 18176,     1,   1]          MUL_MAT   (  4) cpu =   1.000 /   0.250 ms, wall =   0.815 /   0.204 ms [ 30 inpFF*ff_up] [GPUxQ]
 - 913: [ 18176,     1,   1]x[    47,    47,  47]=[ 18176,     1,   1]             GELU   (  4) cpu =   0.000 /   0.000 ms, wall =   0.018 /   0.004 ms [ 30 inpFF*ff_up (view)] [CPU]
 - 914: [ 18176,  4544,   1]x[ 18176,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  4) cpu =   1.000 /   0.250 ms, wall =   0.801 /   0.200 ms [ 30 gelu_cur*ff_down] [GPUxQ]
 - 915: [33554432,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 30 cache_k (view)] [GPU]
 - 916: [    64,     1,   1]x[    47,    47,  47]=[    64,     1,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 30 K] [CPU]
 - 917: [  4672,     1,   1]x[    47,    47,  47]=[    64,    71,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 30 Qcur] [CPU]
 - 918: [    64,    71,   1]x[     4,     1,   1]=[    64,    71,   1]             ROPE   (  4) cpu =   0.000 /   0.000 ms, wall =   0.021 /   0.005 ms [ 30 Qcur (view)] [CPU]
 - 919: [    64,    71,   1]x[    47,    47,  47]=[    64,     1,  71]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 30 Q] [CPU]
 - 920: [    64,     1,   1]x[    64,     1,  71]=[     1,     1,  71]          MUL_MAT   (  4) cpu =   0.000 /   0.000 ms, wall =   0.012 /   0.003 ms [ 30 KQ] [CPU]
 - 921: [     1,     1,  71]x[     1,     1,   1]=[     1,     1,  71]            SCALE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 30 KQ_scaled] [CPU]
 - 922: [     1,     1,  71]x[     2,     1,   1]=[     1,     1,  71]    DIAG_MASK_INF   (  4) cpu =   0.000 /   0.000 ms, wall =   0.004 /   0.001 ms [ 30 KQ_masked] [CPU]
 - 923: [     1,     1,  71]x[    47,    47,  47]=[     1,     1,  71]         SOFT_MAX   (  4) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.001 ms [ 30 KQ_soft_max] [CPU]
 - 924: [     1,    64,   1]x[     1,     1,  71]=[    64,     1,  71]          MUL_MAT   (  4) cpu =   0.000 /   0.000 ms, wall =   0.035 /   0.009 ms [ 30 KQV] [CPU]
 - 925: [    64,     1,  71]x[    47,    47,  47]=[    64,    71,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 30 KQV_merged] [CPU]
 - 926: [    64,    71,   1]x[  4544,     1,   1]=[  4544,     1,   1]              CPY   (  4) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.000 ms [ 30 KQV_merged (copy)] [CPU]
 - 927: [  4544,  4544,   1]x[  4544,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  4) cpu =   0.000 /   0.000 ms, wall =   0.284 /   0.071 ms [ 30 result_wo] [GPUxQ]
 - 928: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              CPY   (  4) cpu =   0.000 /   0.000 ms, wall =   0.004 /   0.001 ms [ 30 attn_out] [CPU]
 - 929: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  4) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.001 ms [ 30 node_929] [CPU]
 - 930: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  4) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.000 ms [ 30 inpFF_+_result_attn_out] [CPU]
 - 931: [  4544,     1,   1]x[    47,    47,  47]=[  4544,     1,   1]             NORM   (  4) cpu =   0.000 /   0.000 ms, wall =   0.010 /   0.003 ms [ 31 node_931] [CPU]
 - 932: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              MUL   (  3) cpu =   0.000 /   0.000 ms, wall =   0.104 /   0.035 ms [ 31 node_932] [CPU]
 - 933: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  3) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.001 ms [ 31 inpFF] [CPU]
 - 934: [  4544,  4672,   1]x[  4544,     1,   1]=[  4672,     1,   1]          MUL_MAT   (  4) cpu =   0.000 /   0.000 ms, wall =   0.281 /   0.070 ms [ 31 node_934] [GPUxQ]
 - 935: [  4672,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 31 Kcur] [CPU]
 - 936: [    64,     1,   1]x[     4,     1,   1]=[    64,     1,   1]             ROPE   (  3) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.000 ms [ 31 Kcur (view)] [CPU]
 - 937: [33554432,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 31 k] [GPU]
 - 938: [    64,     1,   1]x[    64,     1,   1]=[    64,     1,   1]              CPY   (  4) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.000 ms [ 31 k (copy of Kcur (view))] [CPU]
 - 939: [  4672,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 31 Vcur] [CPU]
 - 940: [    64,     1,   1]x[    47,    47,  47]=[     1,    64,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 31 Vcur (permuted)] [CPU]
 - 941: [33554432,     1,   1]x[    47,    47,  47]=[     1,    64,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 31 V] [GPU]
 - 942: [     1,    64,   1]x[     1,    64,   1]=[     1,    64,   1]              CPY   (  4) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.000 ms [ 31 V_new (copy of Vcur (permuted))] [CPU]
 - 943: [  4544, 18176,   1]x[  4544,     1,   1]=[ 18176,     1,   1]          MUL_MAT   (  4) cpu =   1.000 /   0.250 ms, wall =   0.826 /   0.206 ms [ 31 inpFF*ff_up] [GPUxQ]
 - 944: [ 18176,     1,   1]x[    47,    47,  47]=[ 18176,     1,   1]             GELU   (  4) cpu =   0.000 /   0.000 ms, wall =   0.015 /   0.004 ms [ 31 inpFF*ff_up (view)] [CPU]
 - 945: [ 18176,  4544,   1]x[ 18176,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  4) cpu =   1.000 /   0.250 ms, wall =   0.800 /   0.200 ms [ 31 gelu_cur*ff_down] [GPUxQ]
 - 946: [33554432,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 31 cache_k (view)] [GPU]
 - 947: [    64,     1,   1]x[    47,    47,  47]=[    64,     1,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 31 K] [CPU]
 - 948: [  4672,     1,   1]x[    47,    47,  47]=[    64,    71,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 31 Qcur] [CPU]
 - 949: [    64,    71,   1]x[     4,     1,   1]=[    64,    71,   1]             ROPE   (  4) cpu =   0.000 /   0.000 ms, wall =   0.021 /   0.005 ms [ 31 Qcur (view)] [CPU]
 - 950: [    64,    71,   1]x[    47,    47,  47]=[    64,     1,  71]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 31 Q] [CPU]
 - 951: [    64,     1,   1]x[    64,     1,  71]=[     1,     1,  71]          MUL_MAT   (  4) cpu =   0.000 /   0.000 ms, wall =   0.010 /   0.003 ms [ 31 KQ] [CPU]
 - 952: [     1,     1,  71]x[     1,     1,   1]=[     1,     1,  71]            SCALE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 31 KQ_scaled] [CPU]
 - 953: [     1,     1,  71]x[     2,     1,   1]=[     1,     1,  71]    DIAG_MASK_INF   (  4) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.000 ms [ 31 KQ_masked] [CPU]
 - 954: [     1,     1,  71]x[    47,    47,  47]=[     1,     1,  71]         SOFT_MAX   (  3) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.001 ms [ 31 KQ_soft_max] [CPU]
 - 955: [     1,    64,   1]x[     1,     1,  71]=[    64,     1,  71]          MUL_MAT   (  4) cpu =   0.000 /   0.000 ms, wall =   0.035 /   0.009 ms [ 31 KQV] [CPU]
 - 956: [    64,     1,  71]x[    47,    47,  47]=[    64,    71,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 31 KQV_merged] [CPU]
 - 957: [    64,    71,   1]x[  4544,     1,   1]=[  4544,     1,   1]              CPY   (  4) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.000 ms [ 31 KQV_merged (copy)] [CPU]
 - 958: [  4544,  4544,   1]x[  4544,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  4) cpu =   0.000 /   0.000 ms, wall =   0.280 /   0.070 ms [ 31 result_wo] [GPUxQ]
 - 959: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              CPY   (  4) cpu =   0.000 /   0.000 ms, wall =   0.004 /   0.001 ms [ 31 attn_out] [CPU]
 - 960: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  4) cpu =   0.000 /   0.000 ms, wall =   0.005 /   0.001 ms [ 31 node_960] [CPU]
 - 961: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  4) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.001 ms [ 31 inpFF_+_result_attn_out] [CPU]
 - 962: [  4544,     1,   1]x[    47,    47,  47]=[  4544,     1,   1]             NORM   (  4) cpu =   0.000 /   0.000 ms, wall =   0.008 /   0.002 ms [ 32 node_962] [CPU]
 - 963: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              MUL   (  3) cpu =   1.000 /   0.333 ms, wall =   0.107 /   0.036 ms [ 32 node_963] [CPU]
 - 964: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  4) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.000 ms [ 32 inpFF] [CPU]
 - 965: [  4544,  4672,   1]x[  4544,     1,   1]=[  4672,     1,   1]          MUL_MAT   (  4) cpu =   0.000 /   0.000 ms, wall =   0.284 /   0.071 ms [ 32 node_965] [GPUxQ]
 - 966: [  4672,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 32 Kcur] [CPU]
 - 967: [    64,     1,   1]x[     4,     1,   1]=[    64,     1,   1]             ROPE   (  4) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.000 ms [ 32 Kcur (view)] [CPU]
 - 968: [33554432,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 32 k] [GPU]
 - 969: [    64,     1,   1]x[    64,     1,   1]=[    64,     1,   1]              CPY   (  4) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.000 ms [ 32 k (copy of Kcur (view))] [CPU]
 - 970: [  4672,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 32 Vcur] [CPU]
 - 971: [    64,     1,   1]x[    47,    47,  47]=[     1,    64,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 32 Vcur (permuted)] [CPU]
 - 972: [33554432,     1,   1]x[    47,    47,  47]=[     1,    64,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 32 V] [GPU]
 - 973: [     1,    64,   1]x[     1,    64,   1]=[     1,    64,   1]              CPY   (  4) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.000 ms [ 32 V_new (copy of Vcur (permuted))] [CPU]
 - 974: [  4544, 18176,   1]x[  4544,     1,   1]=[ 18176,     1,   1]          MUL_MAT   (  4) cpu =   1.000 /   0.250 ms, wall =   0.821 /   0.205 ms [ 32 inpFF*ff_up] [GPUxQ]
 - 975: [ 18176,     1,   1]x[    47,    47,  47]=[ 18176,     1,   1]             GELU   (  4) cpu =   0.000 /   0.000 ms, wall =   0.015 /   0.004 ms [ 32 inpFF*ff_up (view)] [CPU]
 - 976: [ 18176,  4544,   1]x[ 18176,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  4) cpu =   0.000 /   0.000 ms, wall =   0.798 /   0.200 ms [ 32 gelu_cur*ff_down] [GPUxQ]
 - 977: [33554432,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 32 cache_k (view)] [GPU]
 - 978: [    64,     1,   1]x[    47,    47,  47]=[    64,     1,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 32 K] [CPU]
 - 979: [  4672,     1,   1]x[    47,    47,  47]=[    64,    71,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 32 Qcur] [CPU]
 - 980: [    64,    71,   1]x[     4,     1,   1]=[    64,    71,   1]             ROPE   (  4) cpu =   0.000 /   0.000 ms, wall =   0.021 /   0.005 ms [ 32 Qcur (view)] [CPU]
 - 981: [    64,    71,   1]x[    47,    47,  47]=[    64,     1,  71]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 32 Q] [CPU]
 - 982: [    64,     1,   1]x[    64,     1,  71]=[     1,     1,  71]          MUL_MAT   (  4) cpu =   0.000 /   0.000 ms, wall =   0.011 /   0.003 ms [ 32 KQ] [CPU]
 - 983: [     1,     1,  71]x[     1,     1,   1]=[     1,     1,  71]            SCALE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 32 KQ_scaled] [CPU]
 - 984: [     1,     1,  71]x[     2,     1,   1]=[     1,     1,  71]    DIAG_MASK_INF   (  4) cpu =   0.000 /   0.000 ms, wall =   0.004 /   0.001 ms [ 32 KQ_masked] [CPU]
 - 985: [     1,     1,  71]x[    47,    47,  47]=[     1,     1,  71]         SOFT_MAX   (  4) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.000 ms [ 32 KQ_soft_max] [CPU]
 - 986: [     1,    64,   1]x[     1,     1,  71]=[    64,     1,  71]          MUL_MAT   (  4) cpu =   0.000 /   0.000 ms, wall =   0.035 /   0.009 ms [ 32 KQV] [CPU]
 - 987: [    64,     1,  71]x[    47,    47,  47]=[    64,    71,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 32 KQV_merged] [CPU]
 - 988: [    64,    71,   1]x[  4544,     1,   1]=[  4544,     1,   1]              CPY   (  4) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.001 ms [ 32 KQV_merged (copy)] [CPU]
 - 989: [  4544,  4544,   1]x[  4544,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  4) cpu =   1.000 /   0.250 ms, wall =   0.285 /   0.071 ms [ 32 result_wo] [GPUxQ]
 - 990: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              CPY   (  4) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.001 ms [ 32 attn_out] [CPU]
 - 991: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  4) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.000 ms [ 32 node_991] [CPU]
 - 992: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  4) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.001 ms [ 32 inpFF_+_result_attn_out] [CPU]
 - 993: [  4544,     1,   1]x[    47,    47,  47]=[  4544,     1,   1]             NORM   (  4) cpu =   0.000 /   0.000 ms, wall =   0.007 /   0.002 ms [  0 norm_cur] [CPU]
 - 994: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              MUL   (  4) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.000 ms [  0 node_994] [CPU]
 - 995: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  3) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.000 ms [  0 result_norm] [CPU]
 - 996: [  4544, 65024,   1]x[  4544,     1,   1]=[ 65024,     1,   1]          MUL_MAT   (  3) cpu =   2.000 /   0.667 ms, wall =   2.637 /   0.879 ms [  0 result_lm_head] [GPUxQ]  (Slow)
perf_total_per_op_us[             ADD] =   0.362 ms
perf_total_per_op_us[             MUL] =   4.389 ms
perf_total_per_op_us[            GELU] =   0.748 ms
perf_total_per_op_us[            NORM] =   0.258 ms
perf_total_per_op_us[         MUL_MAT] = 349.873 ms
perf_total_per_op_us[           SCALE] =   0.032 ms
perf_total_per_op_us[             CPY] =   0.437 ms
perf_total_per_op_us[            VIEW] =   0.192 ms
perf_total_per_op_us[         PERMUTE] =   0.128 ms
perf_total_per_op_us[        GET_ROWS] =   0.018 ms
perf_total_per_op_us[   DIAG_MASK_INF] =   0.085 ms
perf_total_per_op_us[        SOFT_MAX] =   0.088 ms
perf_total_per_op_us[            ROPE] =   0.925 ms
========================================
 Bumble bees are a type of bee known for their unique ability to buzz and sting. They are typically rounder than honey bees and have a bright yellow and black stripy pattern on their abdomen. They are important pollinators for many plants, helping to produce fruits and vegetables. Their larvae are herbivorous and feed on nectar and pollen. Bumble bees are social insects and live in large colonies with a queen bee, worker bees, and drone bees. They are often seen collecting nectar from flowers in the early morning and evening.

> Tell me about wasps
 Wasps are a type of bee that are known for their distinctive narrow shape and bright yellow and black stripy pattern on their abdomen. They are important pollinators for many plants and can sting humans. They feed on nectar, honey, and other insects. Wasps are social insects and live in large, organized colonies with a queen bee and worker bees.
User
>