Mul_mat Speedup?? - Githubissues

boricuapab commented 1 year ago

Im not too familiar with mul_mat, but it seem's like it is the part of the process that takes the longest time, is that able to be optimized even further?

The current speed is great for a falcon model, I had tested the original gptq ones and those were so slow in ooba.

C:\falcGGML\ggllm.cpp\build\bin\Release>title falcon_main.cpp

C:\falcGGML\ggllm.cpp\build\bin\Release>falcon_main -t 2 -b 1 -ngl 100 -m wizard-falcon-7b.ggmlv3.q5_1.bin --color -c 2048 -p "What is the difference between a falcon and an eagle?\n### Response:" -s 1686779952 --gpu-reserve-mb-main 1 --debug-timings 1
WARNING: when using cuBLAS generation results are NOT guaranteed to be reproducible.
main: build = 796 (c4d4d5f)
falcon.cpp: loading model from wizard-falcon-7b.ggmlv3.q5_1.bin
falcon.cpp: file version 4
+---------------+------------+---------+-------+--------+---------------+---------+--------+-------+--------+
|          Info |     format | n_vocab | n_ctx | n_embd |   n_head ; kv | n_layer | falcon | ftype |   n_ff |
+---------------+------------+---------+-------+--------+---------------+---------+--------+-------+--------+
|               |    ggml v3 |   65025 |  2048 |   4544 |      71 ;   1 |      32 |  7; 7B |     9 |  18176 |
+---------------+------------+---------+-------+--------+---------------+---------+--------+-------+--------+
falcon_model_load_internal: ggml ctx size =    0.00 MB (mmap size = 5163.00 MB)
falcon_model_load_internal: using CUDA for GPU acceleration
falcon_model_load_internal: VRAM free: 7157.00 MB  of 8191.00 MB (in use: 1034.00 MB)
falcon_model_load_internal: allocating batch_size x 1 MB = 0 MB VRAM for the scratch buffer
falcon_model_load_internal: Offloading Output head tensor (211 MB)
falcon_model_load_internal: mem required  = 2004.00 MB (+   32.00 MB per state)
falcon_model_load_internal: offloading 32 of 32 layers to GPU, weights offloaded 4951.14 MB
falcon_model_load_internal: estimated VRAM usage: 4952 MB
[==================================================] 100%  Tensors populated, CUDA ready
falcon_init_from_file: kv self size  =   32.00 MB
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+
| ID | Device                     1 found | VRAM Total | VRAM Free | VRAM Used | Split at  |    Device |
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+
|  0 | NVIDIA GeForce RTX 2060 SUPER      |    8191 MB |   1885 MB |   6306 MB |      0.0% |   Primary |
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+

+---------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+
|  System Info  | AVX | AVX2 | AVX512 | AVX512_VBMI | AVX512_VNNI | FMA | NEON | ARM_FMA | F16C | FP16_VA | SIMD | BLAS | SSE3 | VSX |
+---------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+
|  2/16 threads | 1   | 1    | 0      | 0           | 0           | 1   | 0    | 0       | 1    | 0       | 0    | 1    | 1    | 0   |
+---------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+

+------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+--------+---------+
|   Sampling | rpt_n | rpt_p | prs_p | frq_p | top_k | tfs_z | top_p | typ_p | temp | miro | mir_lr | mir_ent |
+------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+--------+---------+
|            |    64 | 1.100 | 0.000 | 0.000 |    40 | 1.000 | 0.950 | 1.000 | 0.80 |    0 | 0.1000 | 5.00000 |
+============+=======+=======+=======+=======+=======+=======+====---+-------+------+------+--------+---------+
| Generation |   n_ctx |  n_batch | n_keep | prompt |       seed |
+------------+---------+----------+--------+--------+------------+
|            |    2048 |        1 |      0 |     17 | 1686779952 |
+------------+---------+----------+--------+--------+------------+

What is=== GRAPH ===
n_nodes = 1189
 - Idx: [   S00,   S01, S02]x[   S10,   S11, S12]=[    D0,    D1,  D2]         Op_Label Gx (Runs) cpu =  Cycles / Avg_Cyc ms, wall =    Time / Avg_Time ms [Layer Name] Device  Impact
 -   0: [  4544, 65025,   1]x[     1,     1,   1]=[  4544,     1,   1]         GET_ROWS   (  1) cpu =   0.000 /   0.000 ms, wall =   0.009 /   0.009 ms [  0 node_0]   CPU
 -   1: [  4544,     1,   1]x[    47,    47,  47]=[  4544,     1,   1]             NORM   (  1) cpu =   0.000 /   0.000 ms, wall =   0.056 /   0.056 ms [  1 node_1]   CPU
 -   2: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              MUL   (  1) cpu =   1.000 /   1.000 ms, wall =   0.280 /   0.280 ms [  1 node_2]   CPU  (Slow)
 -   3: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.003 ms [  1 inpFF]   CPU
 -   4: [  4544,  4672,   1]x[  4544,     1,   1]=[  4672,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.275 /   0.275 ms [  1 node_4]   GPU
 -   5: [  4672,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  1 Kcur]   CPU
 -   6: [    64,     1,   1]x[     3,     1,   1]=[    64,     1,   1]             ROPE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.005 /   0.005 ms [  1 node_6]   CPU
 -   7: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  1 k]   GPU
 -   8: [    64,     1,   1]x[    64,     1,   1]=[    64,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  1 node_8]   CPU
 -   9: [  4672,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  1 Vcur]   CPU
 -  10: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  1 v]   GPU
 -  11: [    64,     1,   1]x[    64,     1,   1]=[    64,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  1 node_11]   CPU
 -  12: [  4544, 18176,   1]x[  4544,     1,   1]=[ 18176,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.793 /   0.793 ms [  1 inpFF*ff_up]   GPU  (Slow)
 -  13: [ 18176,     1,   1]x[    47,    47,  47]=[ 18176,     1,   1]             GELU   (  1) cpu =   0.000 /   0.000 ms, wall =   0.015 /   0.015 ms [  1 node_13]   CPU
 -  14: [ 18176,  4544,   1]x[ 18176,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.801 /   0.801 ms [  1 gelu_cur*ff_down]   GPU  (Slow)
 -  15: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   2]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  1 node_15]   GPU
 -  16: [    64,     1,   2]x[    47,    47,  47]=[    64,     2,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  1 node_16]   CPU
 -  17: [    64,     2,   1]x[    47,    47,  47]=[     2,    64,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  1 node_17]   CPU
 -  18: [     2,    64,   1]x[    47,    47,  47]=[     2,    64,   1]             CONT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  1 node_18]   CPU
 -  19: [     2,    64,   1]x[     2,    64,  71]=[     2,    64,  71]          REPEAT2   (  1) cpu =   0.000 /   0.000 ms, wall =   0.016 /   0.016 ms [  1 V]   CPU
 -  20: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   2]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  1 node_20]   GPU
 -  21: [    64,     1,   2]x[    47,    47,  47]=[    64,     2,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  1 node_21]   CPU
 -  22: [    64,     2,   1]x[    47,    47,  47]=[    64,     2,   1]             CONT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  1 node_22]   CPU
 -  23: [    64,     2,   1]x[    64,     2,  71]=[    64,     2,  71]          REPEAT2   (  1) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.003 ms [  1 K]   CPU
 -  24: [  4672,     1,   1]x[    47,    47,  47]=[    64,    71,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  1 Qcur]   CPU
 -  25: [    64,    71,   1]x[     3,     1,   1]=[    64,    71,   1]             ROPE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.011 /   0.011 ms [  1 node_25]   CPU
 -  26: [    64,    71,   1]x[    47,    47,  47]=[    64,     1,  71]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  1 Q]   CPU
 -  27: [    64,     2,  71]x[    64,     1,  71]=[     2,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.004 /   0.004 ms [  1 KQ]   CPU
 -  28: [     2,     1,  71]x[     1,     1,   1]=[     2,     1,  71]            SCALE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  1 KQ_scaled]   CPU
 -  29: [     2,     1,  71]x[     2,     1,   1]=[     2,     1,  71]    DIAG_MASK_INF   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [  1 KQ_masked]   CPU
 -  30: [     2,     1,  71]x[    47,    47,  47]=[     2,     1,  71]         SOFT_MAX   (  1) cpu =   0.000 /   0.000 ms, wall =   0.005 /   0.005 ms [  1 KQ_soft_max]   CPU
 -  31: [     2,    64,  71]x[     2,     1,  71]=[    64,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.024 /   0.024 ms [  1 KQV]   CPU
 -  32: [    64,     1,  71]x[    47,    47,  47]=[    64,    71,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  1 KQV_merged]   CPU
 -  33: [    64,    71,   1]x[  4544,     1,   1]=[  4544,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.003 ms [  1 node_33]   CPU
 -  34: [  4544,  4544,   1]x[  4544,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.269 /   0.269 ms [  1 result_wo]   GPU
 -  35: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.003 ms [  1 attn_out]   CPU
 -  36: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [  1 node_36]   CPU
 -  37: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [  1 inpFF_+_result_attn_out]   CPU
 -  38: [  4544,     1,   1]x[    47,    47,  47]=[  4544,     1,   1]             NORM   (  1) cpu =   0.000 /   0.000 ms, wall =   0.008 /   0.008 ms [  2 node_38]   CPU
 -  39: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              MUL   (  1) cpu =   0.000 /   0.000 ms, wall =   0.088 /   0.088 ms [  2 node_39]   CPU
 -  40: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.003 ms [  2 inpFF]   CPU
 -  41: [  4544,  4672,   1]x[  4544,     1,   1]=[  4672,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.272 /   0.272 ms [  2 node_41]   GPU
 -  42: [  4672,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  2 Kcur]   CPU
 -  43: [    64,     1,   1]x[     3,     1,   1]=[    64,     1,   1]             ROPE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [  2 node_43]   CPU
 -  44: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  2 k]   GPU
 -  45: [    64,     1,   1]x[    64,     1,   1]=[    64,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  2 node_45]   CPU
 -  46: [  4672,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  2 Vcur]   CPU
 -  47: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  2 v]   GPU
 -  48: [    64,     1,   1]x[    64,     1,   1]=[    64,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  2 node_48]   CPU
 -  49: [  4544, 18176,   1]x[  4544,     1,   1]=[ 18176,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.786 /   0.786 ms [  2 inpFF*ff_up]   GPU  (Slow)
 -  50: [ 18176,     1,   1]x[    47,    47,  47]=[ 18176,     1,   1]             GELU   (  1) cpu =   0.000 /   0.000 ms, wall =   0.014 /   0.014 ms [  2 node_50]   CPU
 -  51: [ 18176,  4544,   1]x[ 18176,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.797 /   0.797 ms [  2 gelu_cur*ff_down]   GPU  (Slow)
 -  52: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   2]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  2 node_52]   GPU
 -  53: [    64,     1,   2]x[    47,    47,  47]=[    64,     2,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  2 node_53]   CPU
 -  54: [    64,     2,   1]x[    47,    47,  47]=[     2,    64,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  2 node_54]   CPU
 -  55: [     2,    64,   1]x[    47,    47,  47]=[     2,    64,   1]             CONT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  2 node_55]   CPU
 -  56: [     2,    64,   1]x[     2,    64,  71]=[     2,    64,  71]          REPEAT2   (  1) cpu =   0.000 /   0.000 ms, wall =   0.023 /   0.023 ms [  2 V]   CPU
 -  57: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   2]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  2 node_57]   GPU
 -  58: [    64,     1,   2]x[    47,    47,  47]=[    64,     2,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  2 node_58]   CPU
 -  59: [    64,     2,   1]x[    47,    47,  47]=[    64,     2,   1]             CONT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  2 node_59]   CPU
 -  60: [    64,     2,   1]x[    64,     2,  71]=[    64,     2,  71]          REPEAT2   (  1) cpu =   0.000 /   0.000 ms, wall =   0.010 /   0.010 ms [  2 K]   CPU
 -  61: [  4672,     1,   1]x[    47,    47,  47]=[    64,    71,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  2 Qcur]   CPU
 -  62: [    64,    71,   1]x[     3,     1,   1]=[    64,    71,   1]             ROPE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.010 /   0.010 ms [  2 node_62]   CPU
 -  63: [    64,    71,   1]x[    47,    47,  47]=[    64,     1,  71]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  2 Q]   CPU
 -  64: [    64,     2,  71]x[    64,     1,  71]=[     2,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.003 ms [  2 KQ]   CPU
 -  65: [     2,     1,  71]x[     1,     1,   1]=[     2,     1,  71]            SCALE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  2 KQ_scaled]   CPU
 -  66: [     2,     1,  71]x[     2,     1,   1]=[     2,     1,  71]    DIAG_MASK_INF   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  2 KQ_masked]   CPU
 -  67: [     2,     1,  71]x[    47,    47,  47]=[     2,     1,  71]         SOFT_MAX   (  1) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.003 ms [  2 KQ_soft_max]   CPU
 -  68: [     2,    64,  71]x[     2,     1,  71]=[    64,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.021 /   0.021 ms [  2 KQV]   CPU
 -  69: [    64,     1,  71]x[    47,    47,  47]=[    64,    71,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  2 KQV_merged]   CPU
 -  70: [    64,    71,   1]x[  4544,     1,   1]=[  4544,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [  2 node_70]   CPU
 -  71: [  4544,  4544,   1]x[  4544,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.269 /   0.269 ms [  2 result_wo]   GPU
 -  72: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.003 ms [  2 attn_out]   CPU
 -  73: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [  2 node_73]   CPU
 -  74: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [  2 inpFF_+_result_attn_out]   CPU
 -  75: [  4544,     1,   1]x[    47,    47,  47]=[  4544,     1,   1]             NORM   (  1) cpu =   0.000 /   0.000 ms, wall =   0.008 /   0.008 ms [  3 node_75]   CPU
 -  76: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              MUL   (  1) cpu =   0.000 /   0.000 ms, wall =   0.113 /   0.113 ms [  3 node_76]   CPU
 -  77: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [  3 inpFF]   CPU
 -  78: [  4544,  4672,   1]x[  4544,     1,   1]=[  4672,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.272 /   0.272 ms [  3 node_78]   GPU
 -  79: [  4672,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  3 Kcur]   CPU
 -  80: [    64,     1,   1]x[     3,     1,   1]=[    64,     1,   1]             ROPE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [  3 node_80]   CPU
 -  81: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  3 k]   GPU
 -  82: [    64,     1,   1]x[    64,     1,   1]=[    64,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  3 node_82]   CPU
 -  83: [  4672,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  3 Vcur]   CPU
 -  84: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  3 v]   GPU
 -  85: [    64,     1,   1]x[    64,     1,   1]=[    64,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [  3 node_85]   CPU
 -  86: [  4544, 18176,   1]x[  4544,     1,   1]=[ 18176,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.816 /   0.816 ms [  3 inpFF*ff_up]   GPU  (Slow)
 -  87: [ 18176,     1,   1]x[    47,    47,  47]=[ 18176,     1,   1]             GELU   (  1) cpu =   0.000 /   0.000 ms, wall =   0.015 /   0.015 ms [  3 node_87]   CPU
 -  88: [ 18176,  4544,   1]x[ 18176,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.815 /   0.815 ms [  3 gelu_cur*ff_down]   GPU  (Slow)
 -  89: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   2]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  3 node_89]   GPU
 -  90: [    64,     1,   2]x[    47,    47,  47]=[    64,     2,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  3 node_90]   CPU
 -  91: [    64,     2,   1]x[    47,    47,  47]=[     2,    64,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  3 node_91]   CPU
 -  92: [     2,    64,   1]x[    47,    47,  47]=[     2,    64,   1]             CONT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  3 node_92]   CPU
 -  93: [     2,    64,   1]x[     2,    64,  71]=[     2,    64,  71]          REPEAT2   (  1) cpu =   0.000 /   0.000 ms, wall =   0.022 /   0.022 ms [  3 V]   CPU
 -  94: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   2]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  3 node_94]   GPU
 -  95: [    64,     1,   2]x[    47,    47,  47]=[    64,     2,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  3 node_95]   CPU
 -  96: [    64,     2,   1]x[    47,    47,  47]=[    64,     2,   1]             CONT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  3 node_96]   CPU
 -  97: [    64,     2,   1]x[    64,     2,  71]=[    64,     2,  71]          REPEAT2   (  1) cpu =   0.000 /   0.000 ms, wall =   0.010 /   0.010 ms [  3 K]   CPU
 -  98: [  4672,     1,   1]x[    47,    47,  47]=[    64,    71,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  3 Qcur]   CPU
 -  99: [    64,    71,   1]x[     3,     1,   1]=[    64,    71,   1]             ROPE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.009 /   0.009 ms [  3 node_99]   CPU
 - 100: [    64,    71,   1]x[    47,    47,  47]=[    64,     1,  71]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  3 Q]   CPU
 - 101: [    64,     2,  71]x[    64,     1,  71]=[     2,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [  3 KQ]   CPU
 - 102: [     2,     1,  71]x[     1,     1,   1]=[     2,     1,  71]            SCALE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  3 KQ_scaled]   CPU
 - 103: [     2,     1,  71]x[     2,     1,   1]=[     2,     1,  71]    DIAG_MASK_INF   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  3 KQ_masked]   CPU
 - 104: [     2,     1,  71]x[    47,    47,  47]=[     2,     1,  71]         SOFT_MAX   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [  3 KQ_soft_max]   CPU
 - 105: [     2,    64,  71]x[     2,     1,  71]=[    64,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.023 /   0.023 ms [  3 KQV]   CPU
 - 106: [    64,     1,  71]x[    47,    47,  47]=[    64,    71,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  3 KQV_merged]   CPU
 - 107: [    64,    71,   1]x[  4544,     1,   1]=[  4544,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  3 node_107]   CPU
 - 108: [  4544,  4544,   1]x[  4544,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.292 /   0.292 ms [  3 result_wo]   GPU  (Slow)
 - 109: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  3 attn_out]   CPU
 - 110: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  3 node_110]   CPU
 - 111: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [  3 inpFF_+_result_attn_out]   CPU

 - 1074: [  4544,     1,   1]x[    47,    47,  47]=[  4544,     1,   1]             NORM   (  1) cpu =   0.000 /   0.000 ms, wall =   0.007 /   0.007 ms [ 30 node_1074]   CPU
 - 1075: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              MUL   (  1) cpu =   1.000 /   1.000 ms, wall =   0.110 /   0.110 ms [ 30 node_1075]   CPU
 - 1076: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [ 30 inpFF]   CPU
 - 1077: [  4544,  4672,   1]x[  4544,     1,   1]=[  4672,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.293 /   0.293 ms [ 30 node_1077]   GPU  (Slow)
 - 1078: [  4672,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 30 Kcur]   CPU
 - 1079: [    64,     1,   1]x[     3,     1,   1]=[    64,     1,   1]             ROPE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 30 node_1079]   CPU
 - 1080: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 30 k]   GPU
 - 1081: [    64,     1,   1]x[    64,     1,   1]=[    64,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 30 node_1081]   CPU
 - 1082: [  4672,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 30 Vcur]   CPU
 - 1083: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 30 v]   GPU
 - 1084: [    64,     1,   1]x[    64,     1,   1]=[    64,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 30 node_1084]   CPU
 - 1085: [  4544, 18176,   1]x[  4544,     1,   1]=[ 18176,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.805 /   0.805 ms [ 30 inpFF*ff_up]   GPU  (Slow)
 - 1086: [ 18176,     1,   1]x[    47,    47,  47]=[ 18176,     1,   1]             GELU   (  1) cpu =   0.000 /   0.000 ms, wall =   0.016 /   0.016 ms [ 30 node_1086]   CPU
 - 1087: [ 18176,  4544,   1]x[ 18176,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.823 /   0.823 ms [ 30 gelu_cur*ff_down]   GPU  (Slow)
 - 1088: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   2]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 30 node_1088]   GPU
 - 1089: [    64,     1,   2]x[    47,    47,  47]=[    64,     2,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 30 node_1089]   CPU
 - 1090: [    64,     2,   1]x[    47,    47,  47]=[     2,    64,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 30 node_1090]   CPU
 - 1091: [     2,    64,   1]x[    47,    47,  47]=[     2,    64,   1]             CONT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 30 node_1091]   CPU
 - 1092: [     2,    64,   1]x[     2,    64,  71]=[     2,    64,  71]          REPEAT2   (  1) cpu =   0.000 /   0.000 ms, wall =   0.017 /   0.017 ms [ 30 V]   CPU
 - 1093: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   2]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 30 node_1093]   GPU
 - 1094: [    64,     1,   2]x[    47,    47,  47]=[    64,     2,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 30 node_1094]   CPU
 - 1095: [    64,     2,   1]x[    47,    47,  47]=[    64,     2,   1]             CONT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 30 node_1095]   CPU
 - 1096: [    64,     2,   1]x[    64,     2,  71]=[    64,     2,  71]          REPEAT2   (  1) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.003 ms [ 30 K]   CPU
 - 1097: [  4672,     1,   1]x[    47,    47,  47]=[    64,    71,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 30 Qcur]   CPU
 - 1098: [    64,    71,   1]x[     3,     1,   1]=[    64,    71,   1]             ROPE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.009 /   0.009 ms [ 30 node_1098]   CPU
 - 1099: [    64,    71,   1]x[    47,    47,  47]=[    64,     1,  71]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 30 Q]   CPU
 - 1100: [    64,     2,  71]x[    64,     1,  71]=[     2,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [ 30 KQ]   CPU
 - 1101: [     2,     1,  71]x[     1,     1,   1]=[     2,     1,  71]            SCALE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 30 KQ_scaled]   CPU
 - 1102: [     2,     1,  71]x[     2,     1,   1]=[     2,     1,  71]    DIAG_MASK_INF   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 30 KQ_masked]   CPU
 - 1103: [     2,     1,  71]x[    47,    47,  47]=[     2,     1,  71]         SOFT_MAX   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 30 KQ_soft_max]   CPU
 - 1104: [     2,    64,  71]x[     2,     1,  71]=[    64,     1,  71]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.019 /   0.019 ms [ 30 KQV]   CPU
 - 1105: [    64,     1,  71]x[    47,    47,  47]=[    64,    71,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 30 KQV_merged]   CPU
 - 1106: [    64,    71,   1]x[  4544,     1,   1]=[  4544,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 30 node_1106]   CPU
 - 1107: [  4544,  4544,   1]x[  4544,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.292 /   0.292 ms [ 30 result_wo]   GPU  (Slow)
 - 1108: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [ 30 attn_out]   CPU
 - 1109: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 30 node_1109]   CPU
 - 1110: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [ 30 inpFF_+_result_attn_out]   CPU
 - 1111: [  4544,     1,   1]x[    47,    47,  47]=[  4544,     1,   1]             NORM   (  1) cpu =   0.000 /   0.000 ms, wall =   0.007 /   0.007 ms [ 31 node_1111]   CPU
 - 1112: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              MUL   (  1) cpu =   0.000 /   0.000 ms, wall =   0.110 /   0.110 ms [ 31 node_1112]   CPU
 - 1113: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.003 ms [ 31 inpFF]   CPU
 - 1114: [  4544,  4672,   1]x[  4544,     1,   1]=[  4672,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.293 /   0.293 ms [ 31 node_1114]   GPU  (Slow)
 - 1115: [  4672,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 31 Kcur]   CPU
 - 1116: [    64,     1,   1]x[     3,     1,   1]=[    64,     1,   1]             ROPE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 31 node_1116]   CPU
 - 1117: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 31 k]   GPU
 - 1118: [    64,     1,   1]x[    64,     1,   1]=[    64,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 31 node_1118]   CPU
 - 1119: [  4672,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 31 Vcur]   CPU
 - 1120: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 31 v]   GPU
 - 1121: [    64,     1,   1]x[    64,     1,   1]=[    64,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 31 node_1121]   CPU
 - 1122: [  4544, 18176,   1]x[  4544,     1,   1]=[ 18176,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.805 /   0.805 ms [ 31 inpFF*ff_up]   GPU  (Slow)
 - 1123: [ 18176,     1,   1]x[    47,    47,  47]=[ 18176,     1,   1]             GELU   (  1) cpu =   0.000 /   0.000 ms, wall =   0.015 /   0.015 ms [ 31 node_1123]   CPU
 - 1124: [ 18176,  4544,   1]x[ 18176,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.822 /   0.822 ms [ 31 gelu_cur*ff_down]   GPU  (Slow)
 - 1125: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   2]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 31 node_1125]   GPU
 - 1126: [    64,     1,   2]x[    47,    47,  47]=[    64,     2,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 31 node_1126]   CPU
 - 1127: [    64,     2,   1]x[    47,    47,  47]=[     2,    64,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 31 node_1127]   CPU
 - 1128: [     2,    64,   1]x[    47,    47,  47]=[     2,    64,   1]             CONT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 31 node_1128]   CPU
 - 1129: [     2,    64,   1]x[     2,    64,  71]=[     2,    64,  71]          REPEAT2   (  1) cpu =   0.000 /   0.000 ms, wall =   0.017 /   0.017 ms [ 31 V]   CPU
 - 1130: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   2]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 31 node_1130]   GPU
 - 1131: [    64,     1,   2]x[    47,    47,  47]=[    64,     2,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 31 node_1131]   CPU
 - 1132: [    64,     2,   1]x[    47,    47,  47]=[    64,     2,   1]             CONT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 31 node_1132]   CPU
 - 1133: [    64,     2,   1]x[    64,     2,  71]=[    64,     2,  71]          REPEAT2   (  1) cpu =   0.000 /   0.000 ms, wall =   0.004 /   0.004 ms [ 31 K]   CPU
 - 1134: [  4672,     1,   1]x[    47,    47,  47]=[    64,    71,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 31 Qcur]   CPU
 - 1135: [    64,    71,   1]x[     3,     1,   1]=[    64,    71,   1]             ROPE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.010 /   0.010 ms [ 31 node_1135]   CPU
 - 1136: [    64,    71,   1]x[    47,    47,  47]=[    64,     1,  71]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 31 Q]   CPU
 - 1137: [    64,     2,  71]x[    64,     1,  71]=[     2,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 31 KQ]   CPU
 - 1138: [     2,     1,  71]x[     1,     1,   1]=[     2,     1,  71]            SCALE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 31 KQ_scaled]   CPU
 - 1139: [     2,     1,  71]x[     2,     1,   1]=[     2,     1,  71]    DIAG_MASK_INF   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 31 KQ_masked]   CPU
 - 1140: [     2,     1,  71]x[    47,    47,  47]=[     2,     1,  71]         SOFT_MAX   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 31 KQ_soft_max]   CPU
 - 1141: [     2,    64,  71]x[     2,     1,  71]=[    64,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.019 /   0.019 ms [ 31 KQV]   CPU
 - 1142: [    64,     1,  71]x[    47,    47,  47]=[    64,    71,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 31 KQV_merged]   CPU
 - 1143: [    64,    71,   1]x[  4544,     1,   1]=[  4544,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 31 node_1143]   CPU
 - 1144: [  4544,  4544,   1]x[  4544,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.292 /   0.292 ms [ 31 result_wo]   GPU  (Slow)
 - 1145: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 31 attn_out]   CPU
 - 1146: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [ 31 node_1146]   CPU
 - 1147: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [ 31 inpFF_+_result_attn_out]   CPU
 - 1148: [  4544,     1,   1]x[    47,    47,  47]=[  4544,     1,   1]             NORM   (  1) cpu =   0.000 /   0.000 ms, wall =   0.007 /   0.007 ms [ 32 node_1148]   CPU
 - 1149: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              MUL   (  1) cpu =   0.000 /   0.000 ms, wall =   0.112 /   0.112 ms [ 32 node_1149]   CPU
 - 1150: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [ 32 inpFF]   CPU
 - 1151: [  4544,  4672,   1]x[  4544,     1,   1]=[  4672,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.298 /   0.298 ms [ 32 node_1151]   GPU  (Slow)
 - 1152: [  4672,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 32 Kcur]   CPU
 - 1153: [    64,     1,   1]x[     3,     1,   1]=[    64,     1,   1]             ROPE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 32 node_1153]   CPU
 - 1154: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 32 k]   GPU
 - 1155: [    64,     1,   1]x[    64,     1,   1]=[    64,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 32 node_1155]   CPU
 - 1156: [  4672,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 32 Vcur]   CPU
 - 1157: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 32 v]   GPU
 - 1158: [    64,     1,   1]x[    64,     1,   1]=[    64,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 32 node_1158]   CPU
 - 1159: [  4544, 18176,   1]x[  4544,     1,   1]=[ 18176,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.805 /   0.805 ms [ 32 inpFF*ff_up]   GPU  (Slow)
 - 1160: [ 18176,     1,   1]x[    47,    47,  47]=[ 18176,     1,   1]             GELU   (  1) cpu =   0.000 /   0.000 ms, wall =   0.016 /   0.016 ms [ 32 node_1160]   CPU
 - 1161: [ 18176,  4544,   1]x[ 18176,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.821 /   0.821 ms [ 32 gelu_cur*ff_down]   GPU  (Slow)
 - 1162: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   2]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 32 node_1162]   GPU
 - 1163: [    64,     1,   2]x[    47,    47,  47]=[    64,     2,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 32 node_1163]   CPU
 - 1164: [    64,     2,   1]x[    47,    47,  47]=[     2,    64,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 32 node_1164]   CPU
 - 1165: [     2,    64,   1]x[    47,    47,  47]=[     2,    64,   1]             CONT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 32 node_1165]   CPU
 - 1166: [     2,    64,   1]x[     2,    64,  71]=[     2,    64,  71]          REPEAT2   (  1) cpu =   0.000 /   0.000 ms, wall =   0.017 /   0.017 ms [ 32 V]   CPU
 - 1167: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   2]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 32 node_1167]   GPU
 - 1168: [    64,     1,   2]x[    47,    47,  47]=[    64,     2,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 32 node_1168]   CPU
 - 1169: [    64,     2,   1]x[    47,    47,  47]=[    64,     2,   1]             CONT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 32 node_1169]   CPU
 - 1170: [    64,     2,   1]x[    64,     2,  71]=[    64,     2,  71]          REPEAT2   (  1) cpu =   0.000 /   0.000 ms, wall =   0.004 /   0.004 ms [ 32 K]   CPU
 - 1171: [  4672,     1,   1]x[    47,    47,  47]=[    64,    71,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 32 Qcur]   CPU
 - 1172: [    64,    71,   1]x[     3,     1,   1]=[    64,    71,   1]             ROPE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.011 /   0.011 ms [ 32 node_1172]   CPU
 - 1173: [    64,    71,   1]x[    47,    47,  47]=[    64,     1,  71]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 32 Q]   CPU
 - 1174: [    64,     2,  71]x[    64,     1,  71]=[     2,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [ 32 KQ]   CPU
 - 1175: [     2,     1,  71]x[     1,     1,   1]=[     2,     1,  71]            SCALE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 32 KQ_scaled]   CPU
 - 1176: [     2,     1,  71]x[     2,     1,   1]=[     2,     1,  71]    DIAG_MASK_INF   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 32 KQ_masked]   CPU
 - 1177: [     2,     1,  71]x[    47,    47,  47]=[     2,     1,  71]         SOFT_MAX   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 32 KQ_soft_max]   CPU
 - 1178: [     2,    64,  71]x[     2,     1,  71]=[    64,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.019 /   0.019 ms [ 32 KQV]   CPU
 - 1179: [    64,     1,  71]x[    47,    47,  47]=[    64,    71,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 32 KQV_merged]   CPU
 - 1180: [    64,    71,   1]x[  4544,     1,   1]=[  4544,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 32 node_1180]   CPU
 - 1181: [  4544,  4544,   1]x[  4544,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.293 /   0.293 ms [ 32 result_wo]   GPU  (Slow)
 - 1182: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 32 attn_out]   CPU
 - 1183: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 32 node_1183]   CPU
 - 1184: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [ 32 inpFF_+_result_attn_out]   CPU
 - 1185: [  4544,     1,   1]x[    47,    47,  47]=[  4544,     1,   1]             NORM   (  1) cpu =   0.000 /   0.000 ms, wall =   0.007 /   0.007 ms [  0 norm_cur]   CPU
 - 1186: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              MUL   (  1) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.003 ms [  0 node_1186]   CPU
 - 1187: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  0 result_norm]   CPU
 - 1188: [  4544, 65025,   1]x[  4544,     1,   1]=[ 65025,     1,   1]          MUL_MAT   (  1) cpu =   2.000 /   2.000 ms, wall =   2.593 /   2.593 ms [  0 result_lm_head]   GPU  (Slow)
perf_total_per_op_us[             ADD] =   0.182 ms
perf_total_per_op_us[             MUL] =   3.658 ms
perf_total_per_op_us[            GELU] =   0.498 ms
perf_total_per_op_us[            NORM] =   0.282 ms
perf_total_per_op_us[         MUL_MAT] =  74.875 ms
perf_total_per_op_us[           SCALE] =   0.032 ms
perf_total_per_op_us[             CPY] =   0.142 ms
perf_total_per_op_us[            CONT] =   0.064 ms
perf_total_per_op_us[            VIEW] =   0.224 ms
perf_total_per_op_us[         PERMUTE] =   0.160 ms
perf_total_per_op_us[        GET_ROWS] =   0.009 ms
perf_total_per_op_us[   DIAG_MASK_INF] =   0.033 ms
perf_total_per_op_us[        SOFT_MAX] =   0.042 ms
perf_total_per_op_us[            ROPE] =   0.349 ms
perf_total_per_op_us[         REPEAT2] =   0.673 ms
========================================
 the difference between a falcon and an eagle?\n### Response:Falcons are smaller and more agile than eagles. They have curved beaks, sharp talons, and keen eyesight that allows them to swoop down on prey at high speeds. Eagles, on the other hand, are larger and have broader wingspans. They have a hooked beak and powerful legs with sharp claws for grasping their prey. Eagles also have excellent vision but tend to hunt from higher vantage points.<|endoftext|> [end of text]

falcon_print_timings:        load time =  2765.76 ms
falcon_print_timings:      sample time =    30.56 ms /    84 runs   (    0.36 ms per token,  2748.87 tokens per second)
falcon_print_timings:        eval time =  8232.76 ms /   100 runs   (   82.33 ms per token,    12.15 tokens per second)
falcon_print_timings:       total time =  8277.83 ms

C:\falcGGML\ggllm.cpp\build\bin\Release>pause
Press any key to continue . . .

cmp-nct commented 1 year ago

Definitely we are just at the beginning, especially for smaller VRAM you should see a great improvement with the upcoming commit (hopefully within an hour) The new commit is just finishing up, especially Wizard was a pain as it uses custom shaped tensors and vocabulary.

I am getting up to 60 tokens/sec on 7B and I've seen more than 22 tokens on 40B and generations of 2000 tokens working fine. Just check again in a half hour or so.

cmp-nct commented 1 year ago

.\build\bin\Release\falcon_main.exe   -m Q:\models\TheBloke\wizard-falcon-7b.ggmlv3.q5_1.bin -b 1 --color -c 2048 -p "What is the difference between a falcon and an eagle?\n### Response:" -s 1686779952 --debug-timings 0 -e --override-max-gpu 1  -t 2
WARNING: when using cuBLAS generation results are NOT guaranteed to be reproducible.
main: build = 804 (4ca3961)
falcon.cpp: loading model from Q:\models\TheBloke\wizard-falcon-7b.ggmlv3.q5_1.bin
falcon.cpp: file version 4
+---------------+------------+---------+-------+--------+---------------+---------+--------+-------+--------+
|          Info |     format | n_vocab | n_ctx | n_embd |   n_head ; kv | n_layer | falcon | ftype |   n_ff |
+---------------+------------+---------+-------+--------+---------------+---------+--------+-------+--------+
|               |    ggml v3 |   65024 |  2048 |   4544 |      71 ;   1 |      32 |  7; 7B |     9 |  18176 |
+---------------+------------+---------+-------+--------+---------------+---------+--------+-------+--------+
falcon_model_load_internal: ggml ctx size =    0.00 MB (mmap size = 5163.00 MB)
falcon_model_load_internal: using CUDA for GPU acceleration
falcon_model_load_internal: VRAM free: 23002.00 MB  of 24563.00 MB (in use: 1561.00 MB)
falcon_model_load_internal: allocating batch_size x 1 MB = 0 MB VRAM for the scratch buffer
falcon.cpp: Special mode: Wizard-type finetuning - changing tensor shape
falcon.cpp: Special mode: Wizard-type finetuning - changing tensor shape
falcon_model_load_internal: Offloading Output head tensor (211 MB)
falcon_model_load_internal: mem required  = 1838.11 MB (+   32.00 MB per state)
falcon_model_load_internal: offloading 32 of 32 layers to GPU, weights offloaded 5117.03 MB
falcon_model_load_internal: estimated VRAM usage: 5150 MB
[==================================================] 100%  Tensors populated, CUDA ready
falcon_init_from_file: kv self size  =   32.00 MB
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+
| ID | Device                     1 found | VRAM Total | VRAM Free | VRAM Used | Split at  |    Device |
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+
|  0 | NVIDIA GeForce RTX 4090            |   24563 MB |  17664 MB |   6899 MB |      0.0% |   Primary |
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+

+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+
| Syst. Info | AVX | AVX2 | AVX512 | AVX512_VBMI | AVX512_VNNI | FMA | NEON | ARM_FMA | F16C | FP16_VA | SIMD | BLAS | SSE3 | VSX |
+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+
|  2/32 thrd | 1   | 1    | 0      | 0           | 0           | 1   | 0    | 0       | 1    | 0       | 0    | 1    | 1    | 0   |
+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+

+------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+--------+---------+
|   Sampling | rpt_n | rpt_p | prs_p | frq_p | top_k | tfs_z | top_p | typ_p | temp | miro | mir_lr | mir_ent |
+------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+--------+---------+
|            |    64 | 1.100 | 0.000 | 0.000 |    40 | 1.000 | 0.950 | 1.000 | 0.80 |    0 | 0.1000 | 5.00000 |
+============+=======+=======+=======+=======+=======+=======+-------+-------+------+------+--------+---------+
| Generation |   Ctx | Batch |  Keep | Prmpt |          Seed |
+------------+-------+-------+-------+-------+---------------+
|            |  2048 |     1 |     0 |    16 |    1686779952 |
+------------+-------+-------+-------+-------+---------------+

What is the difference between a falcon and an eagle?
### Response:The main difference between a falcon and an eagle is their size. Eagles are larger than falcons, with wingspans of up to 7 feet compared to a falcon's 3-4 foot wingspan. Eagles also have more powerful builds, with heavier bodies and stronger legs. Additionally, eagles have broader, flatter wings that allow for greater lift and control in flight, while falcons have narrower wings with sharper edges that are better suited for diving at prey.<|endoftext|> [end of text]

falcon_print_timings:        load time =  2505.42 ms
falcon_print_timings:      sample time =    21.77 ms /    97 runs   (    0.22 ms per token,  4455.88 tokens per second)
falcon_print_timings:        eval time =  2273.31 ms /   112 runs   (   20.30 ms per token,    49.27 tokens per second)
falcon_print_timings:       total time =  2315.50 ms

I'm quite sure there are optimizations on mul_mat side remaining but the next big step is offloading more operations into cuda

perf_total_per_op_us[             ADD] =   0.198 ms
perf_total_per_op_us[             MUL] =   2.354 ms
perf_total_per_op_us[            GELU] =   0.349 ms
perf_total_per_op_us[            NORM] =   0.319 ms
perf_total_per_op_us[         MUL_MAT] =  15.442 ms
perf_total_per_op_us[           SCALE] =   0.034 ms
perf_total_per_op_us[             CPY] =   0.173 ms
perf_total_per_op_us[            CONT] =   0.032 ms
perf_total_per_op_us[            VIEW] =   0.224 ms
perf_total_per_op_us[         PERMUTE] =   0.160 ms
perf_total_per_op_us[        GET_ROWS] =   0.006 ms
perf_total_per_op_us[   DIAG_MASK_INF] =   0.033 ms
perf_total_per_op_us[        SOFT_MAX] =   0.098 ms
perf_total_per_op_us[            ROPE] =   0.257 ms

boricuapab commented 1 year ago

Very nice, the inference (same seed and prompt), in my case was reduced by 10 seconds and I got 2 more tokens per second using the commit that has the wizard-type finetuning (what is that by the way)

C:\falcGGML\ggllm.cpp\build\bin\Release>title falcon_main.cpp

C:\falcGGML\ggllm.cpp\build\bin\Release>falcon_main -t 2 -b 1 -ngl 100 -m wizard-falcon-7b.ggmlv3.q5_1.bin --color -c 2048 -p "What is the difference between a falcon and an eagle?\n### Response:" -s 1686779952 --gpu-reserve-mb-main 1 --debug-timings 1
WARNING: when using cuBLAS generation results are NOT guaranteed to be reproducible.
main: build = 805 (748bb29)
falcon.cpp: loading model from wizard-falcon-7b.ggmlv3.q5_1.bin
falcon.cpp: file version 4
+---------------+------------+---------+-------+--------+---------------+---------+--------+-------+--------+
|          Info |     format | n_vocab | n_ctx | n_embd |   n_head ; kv | n_layer | falcon | ftype |   n_ff |
+---------------+------------+---------+-------+--------+---------------+---------+--------+-------+--------+
|               |    ggml v3 |   65024 |  2048 |   4544 |      71 ;   1 |      32 |  7; 7B |     9 |  18176 |
+---------------+------------+---------+-------+--------+---------------+---------+--------+-------+--------+
falcon_model_load_internal: ggml ctx size =    0.00 MB (mmap size = 5163.00 MB)
falcon_model_load_internal: using CUDA for GPU acceleration
falcon_model_load_internal: VRAM free: 7157.00 MB  of 8191.00 MB (in use: 1034.00 MB)
falcon_model_load_internal: allocating batch_size x 1 MB = 0 MB VRAM for the scratch buffer
falcon.cpp: Special mode: Wizard-type finetuning - changing tensor shape
falcon.cpp: Special mode: Wizard-type finetuning - changing tensor shape
falcon_model_load_internal: Offloading Output head tensor (211 MB)
falcon_model_load_internal: mem required  = 1838.11 MB (+   32.00 MB per state)
falcon_model_load_internal: offloading 32 of 32 layers to GPU, weights offloaded 5117.03 MB
falcon_model_load_internal: estimated VRAM usage: 5150 MB
[==================================================] 100%  Tensors populated, CUDA ready
falcon_init_from_file: kv self size  =   32.00 MB
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+
| ID | Device                     1 found | VRAM Total | VRAM Free | VRAM Used | Split at  |    Device |
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+
|  0 | NVIDIA GeForce RTX 2060 SUPER      |    8191 MB |   1885 MB |   6306 MB |      0.0% |   Primary |
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+

+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+
| Syst. Info | AVX | AVX2 | AVX512 | AVX512_VBMI | AVX512_VNNI | FMA | NEON | ARM_FMA | F16C | FP16_VA | SIMD | BLAS | SSE3 | VSX |
+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+
|  2/16 thrd | 1   | 1    | 0      | 0           | 0           | 1   | 0    | 0       | 1    | 0       | 0    | 1    | 1    | 0   |
+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+

+------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+--------+---------+
|   Sampling | rpt_n | rpt_p | prs_p | frq_p | top_k | tfs_z | top_p | typ_p | temp | miro | mir_lr | mir_ent |
+------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+--------+---------+
|            |    64 | 1.100 | 0.000 | 0.000 |    40 | 1.000 | 0.950 | 1.000 | 0.80 |    0 | 0.1000 | 5.00000 |
+============+=======+=======+=======+=======+=======+=======+-------+-------+------+------+--------+---------+
| Generation |   Ctx | Batch |  Keep | Prmpt |          Seed |
+------------+-------+-------+-------+-------+---------------+
|            |  2048 |     1 |     0 |    17 |    1686779952 |
+------------+-------+-------+-------+-------+---------------+

What=== GRAPH ===
n_nodes = 1093
 - Idx: [   S00,   S01, S02]x[   S10,   S11, S12]=[    D0,    D1,  D2]         Op_Label Gx (Runs) cpu =  Cycles / Avg_Cyc ms, wall =    Time / Avg_Time ms [Layer Name] Device  Impact
 -   4: [  4544,  4672,   1]x[  4544,     1,   1]=[  4672,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.280 /   0.280 ms [  1 node_4] [GPUxQ]
 -  12: [  4544, 18176,   1]x[  4544,     1,   1]=[ 18176,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.795 /   0.795 ms [  1 inpFF*ff_up] [GPUxQ]  (Slow)
 -  14: [ 18176,  4544,   1]x[ 18176,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.797 /   0.797 ms [  1 gelu_cur*ff_down] [GPUxQ]  (Slow)
 -  24: [    64,     1,   1]x[    64,     1,  71]=[     1,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.004 /   0.004 ms [  1 KQ] [CPU]
 -  28: [     1,    64,   1]x[     1,     1,  71]=[    64,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.026 /   0.026 ms [  1 KQV] [CPU]
 -  31: [  4544,  4544,   1]x[  4544,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.274 /   0.274 ms [  1 result_wo] [GPUxQ]
 -  38: [  4544,  4672,   1]x[  4544,     1,   1]=[  4672,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.275 /   0.275 ms [  2 node_38] [GPUxQ]
 -  46: [  4544, 18176,   1]x[  4544,     1,   1]=[ 18176,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.796 /   0.796 ms [  2 inpFF*ff_up] [GPUxQ]  (Slow)
 -  48: [ 18176,  4544,   1]x[ 18176,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.809 /   0.809 ms [  2 gelu_cur*ff_down] [GPUxQ]  (Slow)
 -  58: [    64,     1,   1]x[    64,     1,  71]=[     1,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.003 ms [  2 KQ] [CPU]
 -  62: [     1,    64,   1]x[     1,     1,  71]=[    64,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.028 /   0.028 ms [  2 KQV] [CPU]
 -  65: [  4544,  4544,   1]x[  4544,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.272 /   0.272 ms [  2 result_wo] [GPUxQ]
 -  72: [  4544,  4672,   1]x[  4544,     1,   1]=[  4672,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.273 /   0.273 ms [  3 node_72] [GPUxQ]
 -  80: [  4544, 18176,   1]x[  4544,     1,   1]=[ 18176,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.798 /   0.798 ms [  3 inpFF*ff_up] [GPUxQ]  (Slow)
 -  82: [ 18176,  4544,   1]x[ 18176,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.821 /   0.821 ms [  3 gelu_cur*ff_down] [GPUxQ]  (Slow)
 -  92: [    64,     1,   1]x[    64,     1,  71]=[     1,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.003 ms [  3 KQ] [CPU]
 -  96: [     1,    64,   1]x[     1,     1,  71]=[    64,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.014 /   0.014 ms [  3 KQV] [CPU]
 -  99: [  4544,  4544,   1]x[  4544,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.299 /   0.299 ms [  3 result_wo] [GPUxQ]  (Slow)

 - 990: [  4544,  4672,   1]x[  4544,     1,   1]=[  4672,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.294 /   0.294 ms [ 30 node_990] [GPUxQ]  (Slow)
 - 998: [  4544, 18176,   1]x[  4544,     1,   1]=[ 18176,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.803 /   0.803 ms [ 30 inpFF*ff_up] [GPUxQ]  (Slow)
 - 1000: [ 18176,  4544,   1]x[ 18176,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.815 /   0.815 ms [ 30 gelu_cur*ff_down] [GPUxQ]  (Slow)
 - 1010: [    64,     1,   1]x[    64,     1,  71]=[     1,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [ 30 KQ] [CPU]
 - 1014: [     1,    64,   1]x[     1,     1,  71]=[    64,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.012 /   0.012 ms [ 30 KQV] [CPU]
 - 1017: [  4544,  4544,   1]x[  4544,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.266 /   0.266 ms [ 30 result_wo] [GPUxQ]
 - 1024: [  4544,  4672,   1]x[  4544,     1,   1]=[  4672,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.292 /   0.292 ms [ 31 node_1024] [GPUxQ]  (Slow)
 - 1032: [  4544, 18176,   1]x[  4544,     1,   1]=[ 18176,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.812 /   0.812 ms [ 31 inpFF*ff_up] [GPUxQ]  (Slow)
 - 1034: [ 18176,  4544,   1]x[ 18176,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.827 /   0.827 ms [ 31 gelu_cur*ff_down] [GPUxQ]  (Slow)
 - 1044: [    64,     1,   1]x[    64,     1,  71]=[     1,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [ 31 KQ] [CPU]
 - 1048: [     1,    64,   1]x[     1,     1,  71]=[    64,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.012 /   0.012 ms [ 31 KQV] [CPU]
 - 1051: [  4544,  4544,   1]x[  4544,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.293 /   0.293 ms [ 31 result_wo] [GPUxQ]  (Slow)
 - 1058: [  4544,  4672,   1]x[  4544,     1,   1]=[  4672,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.297 /   0.297 ms [ 32 node_1058] [GPUxQ]  (Slow)
 - 1066: [  4544, 18176,   1]x[  4544,     1,   1]=[ 18176,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.807 /   0.807 ms [ 32 inpFF*ff_up] [GPUxQ]  (Slow)
 - 1068: [ 18176,  4544,   1]x[ 18176,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.814 /   0.814 ms [ 32 gelu_cur*ff_down] [GPUxQ]  (Slow)
 - 1078: [    64,     1,   1]x[    64,     1,  71]=[     1,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [ 32 KQ] [CPU]
 - 1082: [     1,    64,   1]x[     1,     1,  71]=[    64,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.012 /   0.012 ms [ 32 KQV] [CPU]
 - 1085: [  4544,  4544,   1]x[  4544,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.295 /   0.295 ms [ 32 result_wo] [GPUxQ]  (Slow)
 - 1092: [  4544, 65024,   1]x[  4544,     1,   1]=[ 65024,     1,   1]          MUL_MAT   (  1) cpu =   3.000 /   3.000 ms, wall =   2.591 /   2.591 ms [  0 result_lm_head] [GPUxQ]  (Slow)
perf_total_per_op_us[             ADD] =   0.196 ms
perf_total_per_op_us[             MUL] =   3.513 ms
perf_total_per_op_us[            GELU] =   0.461 ms
perf_total_per_op_us[            NORM] =   0.342 ms
perf_total_per_op_us[         MUL_MAT] =  74.515 ms
perf_total_per_op_us[           SCALE] =   0.034 ms
perf_total_per_op_us[             CPY] =   0.182 ms
perf_total_per_op_us[            CONT] =   0.032 ms
perf_total_per_op_us[            VIEW] =   0.224 ms
perf_total_per_op_us[         PERMUTE] =   0.160 ms
perf_total_per_op_us[        GET_ROWS] =   0.008 ms
perf_total_per_op_us[   DIAG_MASK_INF] =   0.033 ms
perf_total_per_op_us[        SOFT_MAX] =   0.035 ms
perf_total_per_op_us[            ROPE] =   0.264 ms
========================================
 is the difference between a falcon and an eagle?\n### Response:Falcons are smaller and more agile than eagles. They have curved beaks, sharp talons, and keen eyesight that allows them to swoop down on prey at high speeds. Eagles, on the other hand, are larger and have broader wingspans. They have a hooked beak and powerful legs with sharp claws for grasping their prey. Eagles also have excellent vision but tend to hunt from higher vantage points.<|endoftext|> [end of text]

falcon_print_timings:        load time =  2801.90 ms
falcon_print_timings:      sample time =    38.16 ms /    84 runs   (    0.45 ms per token,  2201.49 tokens per second)
falcon_print_timings:        eval time =  7136.30 ms /   100 runs   (   71.36 ms per token,    14.01 tokens per second)
falcon_print_timings:       total time =  7189.28 ms

C:\falcGGML\ggllm.cpp\build\bin\Release>pause
Press any key to continue . . .

boricuapab commented 1 year ago

And I've noticed that using falcon instruct 7b is 5 seconds faster than wizard falcon 7b (again same prompt and seed)

C:\falcGGML\ggllm.cpp\build\bin\Release>title falcon_main.cpp

C:\falcGGML\ggllm.cpp\build\bin\Release>falcon_main -t 2 -b 1 -ngl 100 -m falcon7b-instruct.ggmlv3.q5_1.bin --color -c 2048 -p "What is the difference between a falcon and an eagle?\n### Response:" -s 1686779952 --gpu-reserve-mb-main 1 --debug-timings 1
WARNING: when using cuBLAS generation results are NOT guaranteed to be reproducible.
main: build = 805 (748bb29)
falcon.cpp: loading model from falcon7b-instruct.ggmlv3.q5_1.bin
falcon.cpp: file version 4
+---------------+------------+---------+-------+--------+---------------+---------+--------+-------+--------+
|          Info |     format | n_vocab | n_ctx | n_embd |   n_head ; kv | n_layer | falcon | ftype |   n_ff |
+---------------+------------+---------+-------+--------+---------------+---------+--------+-------+--------+
|               |    ggml v3 |   65024 |  2048 |   4544 |      71 ;   1 |      32 |  7; 7B |     9 |  18176 |
+---------------+------------+---------+-------+--------+---------------+---------+--------+-------+--------+
falcon_model_load_internal: ggml ctx size =    0.00 MB (mmap size = 5163.00 MB)
falcon_model_load_internal: using CUDA for GPU acceleration
falcon_model_load_internal: VRAM free: 7157.00 MB  of 8191.00 MB (in use: 1034.00 MB)
falcon_model_load_internal: allocating batch_size x 1 MB = 0 MB VRAM for the scratch buffer
falcon_model_load_internal: Offloading Output head tensor (211 MB)
falcon_model_load_internal: mem required  = 1838.10 MB (+   32.00 MB per state)
falcon_model_load_internal: offloading 32 of 32 layers to GPU, weights offloaded 5117.03 MB
falcon_model_load_internal: estimated VRAM usage: 5150 MB
[==================================================] 100%  Tensors populated, CUDA ready
falcon_init_from_file: kv self size  =   32.00 MB
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+
| ID | Device                     1 found | VRAM Total | VRAM Free | VRAM Used | Split at  |    Device |
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+
|  0 | NVIDIA GeForce RTX 2060 SUPER      |    8191 MB |   1885 MB |   6306 MB |      0.0% |   Primary |
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+

+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+
| Syst. Info | AVX | AVX2 | AVX512 | AVX512_VBMI | AVX512_VNNI | FMA | NEON | ARM_FMA | F16C | FP16_VA | SIMD | BLAS | SSE3 | VSX |
+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+
|  2/16 thrd | 1   | 1    | 0      | 0           | 0           | 1   | 0    | 0       | 1    | 0       | 0    | 1    | 1    | 0   |
+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+

+------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+--------+---------+
|   Sampling | rpt_n | rpt_p | prs_p | frq_p | top_k | tfs_z | top_p | typ_p | temp | miro | mir_lr | mir_ent |
+------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+--------+---------+
|            |    64 | 1.100 | 0.000 | 0.000 |    40 | 1.000 | 0.950 | 1.000 | 0.80 |    0 | 0.1000 | 5.00000 |
+============+=======+=======+=======+=======+=======+=======+-------+-------+------+------+--------+---------+
| Generation |   Ctx | Batch |  Keep | Prmpt |          Seed |
+------------+-------+-------+-------+-------+---------------+
|            |  2048 |     1 |     0 |    17 |    1686779952 |
+------------+-------+-------+-------+-------+---------------+

What=== GRAPH ===
n_nodes = 1093
 - Idx: [   S00,   S01, S02]x[   S10,   S11, S12]=[    D0,    D1,  D2]         Op_Label Gx (Runs) cpu =  Cycles / Avg_Cyc ms, wall =    Time / Avg_Time ms [Layer Name] Device  Impact
 -   4: [  4544,  4672,   1]x[  4544,     1,   1]=[  4672,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.276 /   0.276 ms [  1 node_4] [GPUxQ]
 -  12: [  4544, 18176,   1]x[  4544,     1,   1]=[ 18176,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.830 /   0.830 ms [  1 inpFF*ff_up] [GPUxQ]  (Slow)
 -  14: [ 18176,  4544,   1]x[ 18176,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.805 /   0.805 ms [  1 gelu_cur*ff_down] [GPUxQ]  (Slow)
 -  24: [    64,     1,   1]x[    64,     1,  71]=[     1,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.003 ms [  1 KQ] [CPU]
 -  28: [     1,    64,   1]x[     1,     1,  71]=[    64,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.025 /   0.025 ms [  1 KQV] [CPU]
 -  31: [  4544,  4544,   1]x[  4544,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.275 /   0.275 ms [  1 result_wo] [GPUxQ]
 -  38: [  4544,  4672,   1]x[  4544,     1,   1]=[  4672,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.272 /   0.272 ms [  2 node_38] [GPUxQ]
 -  46: [  4544, 18176,   1]x[  4544,     1,   1]=[ 18176,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.801 /   0.801 ms [  2 inpFF*ff_up] [GPUxQ]  (Slow)
 -  48: [ 18176,  4544,   1]x[ 18176,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.814 /   0.814 ms [  2 gelu_cur*ff_down] [GPUxQ]  (Slow)
 -  58: [    64,     1,   1]x[    64,     1,  71]=[     1,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.004 /   0.004 ms [  2 KQ] [CPU]
 -  62: [     1,    64,   1]x[     1,     1,  71]=[    64,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.029 /   0.029 ms [  2 KQV] [CPU]
 -  65: [  4544,  4544,   1]x[  4544,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.275 /   0.275 ms [  2 result_wo] [GPUxQ]
 -  72: [  4544,  4672,   1]x[  4544,     1,   1]=[  4672,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.272 /   0.272 ms [  3 node_72] [GPUxQ]
 -  80: [  4544, 18176,   1]x[  4544,     1,   1]=[ 18176,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.823 /   0.823 ms [  3 inpFF*ff_up] [GPUxQ]  (Slow)
 -  82: [ 18176,  4544,   1]x[ 18176,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.824 /   0.824 ms [  3 gelu_cur*ff_down] [GPUxQ]  (Slow)
 -  92: [    64,     1,   1]x[    64,     1,  71]=[     1,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.003 ms [  3 KQ] [CPU]
 -  96: [     1,    64,   1]x[     1,     1,  71]=[    64,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.014 /   0.014 ms [  3 KQV] [CPU]
 -  99: [  4544,  4544,   1]x[  4544,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.297 /   0.297 ms [  3 result_wo] [GPUxQ]  (Slow)

 - 990: [  4544,  4672,   1]x[  4544,     1,   1]=[  4672,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.294 /   0.294 ms [ 30 node_990] [GPUxQ]  (Slow)
 - 998: [  4544, 18176,   1]x[  4544,     1,   1]=[ 18176,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.804 /   0.804 ms [ 30 inpFF*ff_up] [GPUxQ]  (Slow)
 - 1000: [ 18176,  4544,   1]x[ 18176,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.823 /   0.823 ms [ 30 gelu_cur*ff_down] [GPUxQ]  (Slow)
 - 1010: [    64,     1,   1]x[    64,     1,  71]=[     1,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [ 30 KQ] [CPU]
 - 1014: [     1,    64,   1]x[     1,     1,  71]=[    64,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.011 /   0.011 ms [ 30 KQV] [CPU]
 - 1017: [  4544,  4544,   1]x[  4544,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.292 /   0.292 ms [ 30 result_wo] [GPUxQ]  (Slow)
 - 1024: [  4544,  4672,   1]x[  4544,     1,   1]=[  4672,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.292 /   0.292 ms [ 31 node_1024] [GPUxQ]  (Slow)
 - 1032: [  4544, 18176,   1]x[  4544,     1,   1]=[ 18176,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.804 /   0.804 ms [ 31 inpFF*ff_up] [GPUxQ]  (Slow)
 - 1034: [ 18176,  4544,   1]x[ 18176,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.814 /   0.814 ms [ 31 gelu_cur*ff_down] [GPUxQ]  (Slow)
 - 1044: [    64,     1,   1]x[    64,     1,  71]=[     1,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [ 31 KQ] [CPU]
 - 1048: [     1,    64,   1]x[     1,     1,  71]=[    64,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.012 /   0.012 ms [ 31 KQV] [CPU]
 - 1051: [  4544,  4544,   1]x[  4544,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.291 /   0.291 ms [ 31 result_wo] [GPUxQ]  (Slow)
 - 1058: [  4544,  4672,   1]x[  4544,     1,   1]=[  4672,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.293 /   0.293 ms [ 32 node_1058] [GPUxQ]  (Slow)
 - 1066: [  4544, 18176,   1]x[  4544,     1,   1]=[ 18176,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.810 /   0.810 ms [ 32 inpFF*ff_up] [GPUxQ]  (Slow)
 - 1068: [ 18176,  4544,   1]x[ 18176,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.822 /   0.822 ms [ 32 gelu_cur*ff_down] [GPUxQ]  (Slow)
 - 1078: [    64,     1,   1]x[    64,     1,  71]=[     1,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [ 32 KQ] [CPU]
 - 1082: [     1,    64,   1]x[     1,     1,  71]=[    64,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.011 /   0.011 ms [ 32 KQV] [CPU]
 - 1085: [  4544,  4544,   1]x[  4544,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.291 /   0.291 ms [ 32 result_wo] [GPUxQ]  (Slow)
 - 1092: [  4544, 65024,   1]x[  4544,     1,   1]=[ 65024,     1,   1]          MUL_MAT   (  1) cpu =   3.000 /   3.000 ms, wall =   2.596 /   2.596 ms [  0 result_lm_head] [GPUxQ]  (Slow)
perf_total_per_op_us[             ADD] =   0.189 ms
perf_total_per_op_us[             MUL] =   3.520 ms
perf_total_per_op_us[            GELU] =   0.457 ms
perf_total_per_op_us[            NORM] =   0.331 ms
perf_total_per_op_us[         MUL_MAT] =  74.936 ms
perf_total_per_op_us[           SCALE] =   0.034 ms
perf_total_per_op_us[             CPY] =   0.182 ms
perf_total_per_op_us[            CONT] =   0.032 ms
perf_total_per_op_us[            VIEW] =   0.224 ms
perf_total_per_op_us[         PERMUTE] =   0.160 ms
perf_total_per_op_us[        GET_ROWS] =   0.007 ms
perf_total_per_op_us[   DIAG_MASK_INF] =   0.033 ms
perf_total_per_op_us[        SOFT_MAX] =   0.036 ms
perf_total_per_op_us[            ROPE] =   0.287 ms
========================================
 is the difference between a falcon and an eagle?\n### Response: ###\nFalcons are typically medium-sized birds, while eagles are large birds of prey. Eagles are known for their broad wings and beaks, while falcons are known for their small size and slender beaks. Additionally, while eagles have sharp talons, falcons are known for their curved feathers on their feet which they use to maneuver in the air.<|endoftext|> [end of text]

falcon_print_timings:        load time =  3149.17 ms
falcon_print_timings:      sample time =    35.12 ms /    77 runs   (    0.46 ms per token,  2192.48 tokens per second)
falcon_print_timings:        eval time =  6639.14 ms /    93 runs   (   71.39 ms per token,    14.01 tokens per second)
falcon_print_timings:       total time =  6688.38 ms

C:\falcGGML\ggllm.cpp\build\bin\Release>pause
Press any key to continue . . .

boricuapab commented 1 year ago

Here are my speeds using wiz falc 40 b, I had to increase threads to 8 to reduce the total time by 15 seconds

C:\falcGGML\ggllm.cpp\build\bin\Release>title falcon_main.cpp

C:\falcGGML\ggllm.cpp\build\bin\Release>falcon_main -t 8 -b 1 -ngl 100 -m wizard-falcon40b.ggmlv3.q4_K_S.bin --color -c 2048 -p "What is the difference between a falcon and an eagle?\n### Response:" -s 1686779952 --gpu-reserve-mb-main 1 --debug-timings 1
WARNING: when using cuBLAS generation results are NOT guaranteed to be reproducible.
main: build = 805 (748bb29)
falcon.cpp: loading model from wizard-falcon40b.ggmlv3.q4_K_S.bin
falcon.cpp: file version 4
+---------------+------------+---------+-------+--------+---------------+---------+--------+-------+--------+
|          Info |     format | n_vocab | n_ctx | n_embd |   n_head ; kv | n_layer | falcon | ftype |   n_ff |
+---------------+------------+---------+-------+--------+---------------+---------+--------+-------+--------+
|               |    ggml v3 |   65024 |  2048 |   8192 |     128 ;   8 |      60 | 40;40B |    14 |  32768 |
+---------------+------------+---------+-------+--------+---------------+---------+--------+-------+--------+
falcon_model_load_internal: ggml ctx size =    0.00 MB (mmap size = 22449.00 MB)
falcon_model_load_internal: using CUDA for GPU acceleration
falcon_model_load_internal: VRAM free: 7157.00 MB  of 8191.00 MB (in use: 1034.00 MB)
falcon_model_load_internal: allocating batch_size x 1 MB = 0 MB VRAM for the scratch buffer
falcon.cpp: Special mode: Wizard-type finetuning - changing tensor shape
falcon.cpp: Special mode: Wizard-type finetuning - changing tensor shape
falcon_model_load_internal: Offloading Output head tensor (285 MB)
falcon_model_load_internal: INFO: not enough VRAM to offload layer 18 (missing 15798 MB)
falcon_model_load_internal: INFO: 17 layers will be offloaded to GPU (layers 1 to 18)
falcon_model_load_internal: mem required  = 18956.86 MB (+  480.00 MB per state)
falcon_model_load_internal: offloading 17 of 60 layers to GPU, weights offloaded 7076.38 MB
falcon_model_load_internal: estimated VRAM usage: 7109 MB
[==================================================] 100%  Tensors populated, CUDA ready
falcon_init_from_file: kv self size  =  480.00 MB
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+
| ID | Device                     1 found | VRAM Total | VRAM Free | VRAM Used | Split at  |    Device |
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+
|  0 | NVIDIA GeForce RTX 2060 SUPER      |    8191 MB |     46 MB |   8145 MB |      0.0% |   Primary |
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+

+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+
| Syst. Info | AVX | AVX2 | AVX512 | AVX512_VBMI | AVX512_VNNI | FMA | NEON | ARM_FMA | F16C | FP16_VA | SIMD | BLAS | SSE3 | VSX |
+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+
|  8/16 thrd | 1   | 1    | 0      | 0           | 0           | 1   | 0    | 0       | 1    | 0       | 0    | 1    | 1    | 0   |
+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+

+------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+--------+---------+
|   Sampling | rpt_n | rpt_p | prs_p | frq_p | top_k | tfs_z | top_p | typ_p | temp | miro | mir_lr | mir_ent |
+------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+--------+---------+
|            |    64 | 1.100 | 0.000 | 0.000 |    40 | 1.000 | 0.950 | 1.000 | 0.80 |    0 | 0.1000 | 5.00000 |
+============+=======+=======+=======+=======+=======+=======+-------+-------+------+------+--------+---------+
| Generation |   Ctx | Batch |  Keep | Prmpt |          Seed |
+------------+-------+-------+-------+-------+---------------+
|            |  2048 |     1 |     0 |    17 |    1686779952 |
+------------+-------+-------+-------+-------+---------------+

What=== GRAPH ===
n_nodes = 2225
 - Idx: [   S00,   S01, S02]x[   S10,   S11, S12]=[    D0,    D1,  D2]         Op_Label Gx (Runs) cpu =  Cycles / Avg_Cyc ms, wall =    Time / Avg_Time ms [Layer Name] Device  Impact
 -   4: [  8192,  9216,   1]x[  8192,     1,   1]=[  9216,     1,   1]          MUL_MAT   (  1) cpu =   2.000 /   2.000 ms, wall =   1.955 /   1.955 ms [  1 node_4] [GPUxQ]  (Slow)
 -  15: [  8192, 32768,   1]x[  8192,     1,   1]=[ 32768,     1,   1]          MUL_MAT   (  1) cpu =   6.000 /   6.000 ms, wall =   5.543 /   5.543 ms [  1 inpFF*ff_up] [GPUxQ]  (Slow)
 -  17: [ 32768,  8192,   1]x[ 32768,     1,   1]=[  8192,     1,   1]          MUL_MAT   (  1) cpu =   6.000 /   6.000 ms, wall =   5.960 /   5.960 ms [  1 gelu_cur*ff_down] [GPUxQ]  (Slow)
 -  27: [    64,     1,   8]x[    64,     1, 128]=[     1,     1, 128]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.014 /   0.014 ms [  1 KQ] [CPU]
 -  31: [     1,    64,   8]x[     1,     1, 128]=[    64,     1, 128]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.026 /   0.026 ms [  1 KQV] [CPU]
 -  34: [  8192,  8192,   1]x[  8192,     1,   1]=[  8192,     1,   1]          MUL_MAT   (  1) cpu =   2.000 /   2.000 ms, wall =   1.826 /   1.826 ms [  1 result_wo] [GPUxQ]  (Slow)

 - 2177: [     1,    64,   8]x[     1,     1, 128]=[    64,     1, 128]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.023 /   0.023 ms [ 59 KQV] [CPU]
 - 2180: [  8192,  8192,   1]x[  8192,     1,   1]=[  8192,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   1.257 /   1.257 ms [ 59 result_wo] [CPU]
 - 2187: [  8192,  9216,   1]x[  8192,     1,   1]=[  9216,     1,   1]          MUL_MAT   (  1) cpu =   2.000 /   2.000 ms, wall =   1.475 /   1.475 ms [ 60 node_2187] [CPU]
 - 2198: [  8192, 32768,   1]x[  8192,     1,   1]=[ 32768,     1,   1]          MUL_MAT   (  1) cpu =   5.000 /   5.000 ms, wall =   5.135 /   5.135 ms [ 60 inpFF*ff_up] [CPU]  (Slow)
 - 2200: [ 32768,  8192,   1]x[ 32768,     1,   1]=[  8192,     1,   1]          MUL_MAT   (  1) cpu =   5.000 /   5.000 ms, wall =   5.003 /   5.003 ms [ 60 gelu_cur*ff_down] [CPU]  (Slow)
 - 2210: [    64,     1,   8]x[    64,     1, 128]=[     1,     1, 128]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.014 /   0.014 ms [ 60 KQ] [CPU]
 - 2214: [     1,    64,   8]x[     1,     1, 128]=[    64,     1, 128]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.021 /   0.021 ms [ 60 KQV] [CPU]
 - 2217: [  8192,  8192,   1]x[  8192,     1,   1]=[  8192,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   1.268 /   1.268 ms [ 60 result_wo] [CPU]
 - 2224: [  8192, 65024,   1]x[  8192,     1,   1]=[ 65024,     1,   1]          MUL_MAT   (  1) cpu =   4.000 /   4.000 ms, wall =   4.023 /   4.023 ms [  0 result_lm_head] [GPUxQ]  (Slow)
perf_total_per_op_us[             ADD] =   2.689 ms
perf_total_per_op_us[             MUL] =   1.399 ms
perf_total_per_op_us[            GELU] =   3.636 ms
perf_total_per_op_us[            NORM] =   2.814 ms
perf_total_per_op_us[         MUL_MAT] = 779.294 ms
perf_total_per_op_us[           SCALE] =   0.493 ms
perf_total_per_op_us[             CPY] =   2.200 ms
perf_total_per_op_us[            CONT] =   0.128 ms
perf_total_per_op_us[            VIEW] =   0.424 ms
perf_total_per_op_us[         PERMUTE] =   0.300 ms
perf_total_per_op_us[        GET_ROWS] =   0.010 ms
perf_total_per_op_us[   DIAG_MASK_INF] =   0.487 ms
perf_total_per_op_us[        SOFT_MAX] =   0.514 ms
perf_total_per_op_us[            ROPE] =   1.404 ms
========================================
 is the difference between a falcon and an eagle?\n### Response:Falcons and eagles are both birds of prey, but there are some distinct differences between them. Falcons are generally smaller than eagles, with a wingspan that ranges from 14 to 56 inches, while eagles have a wingspan that ranges from 5 to 7 feet. Falcons also have a more slender body and longer, pointed wings, while eagles have a larger head and shorter, broader wings. Additionally, falcons are known for their incredible speed and agility in flight, while eagles are known for their strength and ability to carry large prey.<|endoftext|> [end of text]

falcon_print_timings:        load time =  7120.20 ms
falcon_print_timings:      sample time =    56.90 ms /   112 runs   (    0.51 ms per token,  1968.37 tokens per second)
falcon_print_timings:        eval time = 78104.14 ms /   128 runs   (  610.19 ms per token,     1.64 tokens per second)
falcon_print_timings:       total time = 78185.95 ms

C:\falcGGML\ggllm.cpp\build\bin\Release>pause
Press any key to continue . . .

boricuapab commented 1 year ago

This is fantasctic!!

The optimizations you've released today have cut down the total time by half using wiz falc 40b, it went down from 11.58 minutes to 4.98 minutes on my rig, using -t 8 and -b 512

This was the time it took to generate the story in the windows install video I did

C:\falcGGML\ggllm.cpp\build\bin\Release>title falcon_main.cpp

C:\falcGGML\ggllm.cpp\build\bin\Release>falcon_main -t 8 -ngl 100 -m wizard-falcon40b.ggmlv3.q4_K_S.bin --color -c 2048 -p "Tell me a story about robot falcons from outer space.\n### Response:" -s 1686779952
WARNING: when using cuBLAS generation results are NOT guaranteed to be reproducible.
main: build = 774 (e97d148)
main: seed  = 1686779952

CUDA Device Summary - 1 devices found
+------------------------------------+------------+-----------+-----------+-----------+-----------+
| Device                             | VRAM Total | VRAM Free | VRAM Used | Split at  | Device ID |
+------------------------------------+------------+-----------+-----------+-----------+-----------+
| NVIDIA GeForce RTX 2060 SUPER      |    8191 MB |   7163 MB |   1028 MB |      0.0% |  0 (Main) |
+------------------------------------+------------+-----------+-----------+-----------+-----------+
Total VRAM: 8.00 GB, Total available VRAM: 7.00 GB
--------------------
Preparing CUDA for device(s):
[0]... [done]
falcon.cpp: loading model from wizard-falcon40b.ggmlv3.q4_K_S.bin
falcon.cpp: file version 4
falcon_model_load_internal: format     = ggjt v3 (latest)
falcon_model_load_internal: n_vocab    = 65025
falcon_model_load_internal: n_ctx      = 2048
falcon_model_load_internal: n_embd     = 8192
falcon_model_load_internal: n_head     = 128
falcon_model_load_internal: n_head_kv     = 8
falcon_model_load_internal: n_layer    = 60
falcon_model_load_internal: n_falcon_type      = 40
falcon_model_load_internal: ftype      = 14 (mostly Q4_K - Small)
falcon_model_load_internal: n_ff       = 32768
falcon_model_load_internal: n_parts    = 1
falcon_model_load_internal: model size = 40B
falcon_model_load_internal: ggml ctx size =    0.00 MB (mmap size = 22449.00 MB)
falcon_model_load_internal: using CUDA for GPU acceleration
falcon_model_load_internal: INFO: using n_batch > 1 will require additional VRAM per device: 2818.00 MB
falcon_model_load_internal: VRAM free: 6961.00 MB  of 8191.00 MB (in use: 1230.00 MB)
falcon_model_load_internal: allocating batch_size x 1 MB = 0 MB VRAM for the scratch buffer
falcon_model_load_internal: Offloading Output head tensor (285 MB)
INFO: Not enough VRAM to load all requested layers - at layer 8 of 60: skipping
INFO: 8 layers will be offloaded to GPU (layers 1 to 9)
falcon_model_load_internal: mem required  = 22466.99 MB (+  480.00 MB per state)
falcon_model_load_internal: offloading 8 of 60 layers to GPU, weights offloaded 3566.25 MB
falcon_model_load_internal: estimated VRAM usage: 6385 MB
[==================================================] 100%  Tensors populated
falcon_model_load_internal: VRAM free: 3381.00 MB  of 8191.00 MB (used: 4810.00 MB)
falcon_init_from_file: kv self size  =  480.00 MB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = -1, n_keep = 0

Tell me a story about robot falcons from outer space.\n### Response:Once upon a time, in a far-off galaxy, there was a civilization of robots who had evolved to resemble birds of prey. They were called the Falconoids, and they lived on a planet that orbited a binary star system. The Falconoids had developed advanced technology that allowed them to travel through space, and they had used it to explore neighboring galaxies.
One day, the Falconoids detected a strange signal coming from a distant planet in a solar system near their own. They sent a small fleet of robot falcons to investigate, but when they arrived, they found that the planet was already inhabited by intelligent life forms that resembled humans. The Falconoids had never encountered such creatures before, and they were fascinated by them.
The Falconoids decided to observe the humans from afar, without revealing themselves. They sent their falcon robots to fly over the planet's cities and countryside, gathering information about the inhabitants' behavior and technology. Over time, the Falconoids learned much about human society, including its weaknesses and strengths.
One day, a group of humans stumbled upon one of the falcon robots while hiking in the mountains. The robot had landed on a rocky outcropping, and it was unable to take off again. The humans approached the robot cautiously, not knowing what to expect. To their surprise, the robot spoke to them in perfect English, explaining that it was a visitor from another world.
The humans were stunned by this revelation, but they eventually came to accept the falcon robot as one of their own. They named it "Falco," and they took care of it like a beloved pet. Falco continued to gather information about human society, but now it was also transmitting that information back to its home planet.
As time passed, more and more Falconoid robots arrived on Earth, disguised as birds of prey. They integrated themselves into human society, learning everything they could about the humans' culture and technology. Some even took on human identities, posing as scientists or engineers.
Eventually, the Falconoids decided that it was time to reveal themselves to humanity. They descended from the skies in their spaceships, announcing their presence and offering their advanced technology to the humans. The humans were amazed by the Falconoids' generosity, and they gratefully accepted their offer of friendship and cooperation.
From that day forward, the Falconoids and humans worked together to build a better future for both species. They shared knowledge and resources, and they built a network of interstellar trade and communication that spanned the galaxy. The Falconoids even helped the humans develop their own space program, so that they could explore the stars alongside their robot friends.
And so, the Falconoids and humans lived together in peace and harmony, each species learning from the other and growing stronger as a result. They looked to the stars with wonder and excitement, knowing that there were still many mysteries to uncover and new worlds to explore.<|endoftext|> [end of text]

falcon_print_timings:        load time = 50344.30 ms
falcon_print_timings:      sample time =   284.85 ms /   595 runs   (    0.48 ms per token,  2088.80 tokens per second)
falcon_print_timings: batch eval time = 11017.97 ms /    16 tokens (  688.62 ms per token,     1.45 tokens per second)
falcon_print_timings:        eval time = 683718.28 ms /   594 runs   ( 1151.04 ms per token,     0.87 tokens per second)
falcon_print_timings:       total time = 695231.39 ms

C:\falcGGML\ggllm.cpp\build\bin\Release>pause
Press any key to continue . . .

And this is the result after today's release

C:\falcGGML\ggllm.cpp\build\bin\Release>title falcon_main.cpp

C:\falcGGML\ggllm.cpp\build\bin\Release>falcon_main -t 8 -b 512 -ngl 100 -m wizard-falcon40b.ggmlv3.q4_K_S.bin --color -c 2048 -p "Tell me a story about robot falcons from outer space.\n### Response:" -s 1686779952 --gpu-reserve-mb-main 1 --debug-timings 0
WARNING: when using cuBLAS generation results are NOT guaranteed to be reproducible.
main: build = 805 (748bb29)
falcon.cpp: loading model from wizard-falcon40b.ggmlv3.q4_K_S.bin
falcon.cpp: file version 4
+---------------+------------+---------+-------+--------+---------------+---------+--------+-------+--------+
|          Info |     format | n_vocab | n_ctx | n_embd |   n_head ; kv | n_layer | falcon | ftype |   n_ff |
+---------------+------------+---------+-------+--------+---------------+---------+--------+-------+--------+
|               |    ggml v3 |   65024 |  2048 |   8192 |     128 ;   8 |      60 | 40;40B |    14 |  32768 |
+---------------+------------+---------+-------+--------+---------------+---------+--------+-------+--------+
falcon_model_load_internal: ggml ctx size =    0.00 MB (mmap size = 22449.00 MB)
falcon_model_load_internal: using CUDA for GPU acceleration
falcon_model_load_internal: VRAM free: 7157.00 MB  of 8191.00 MB (in use: 1034.00 MB)
falcon_model_load_internal: INFO: using n_batch larger than 1 requires additional VRAM per device: 1754.00 MB
falcon_model_load_internal: allocating batch_size x 1 MB = 0 MB VRAM for the scratch buffer
falcon.cpp: Special mode: Wizard-type finetuning - changing tensor shape
falcon.cpp: Special mode: Wizard-type finetuning - changing tensor shape
falcon_model_load_internal: Offloading Output head tensor (285 MB)
falcon_model_load_internal: INFO: not enough VRAM to offload layer 13 (missing 17520 MB)
falcon_model_load_internal: INFO: 12 layers will be offloaded to GPU (layers 1 to 13)
falcon_model_load_internal: mem required  = 20843.14 MB (+  480.00 MB per state)
falcon_model_load_internal: offloading 12 of 60 layers to GPU, weights offloaded 5190.10 MB
falcon_model_load_internal: estimated VRAM usage: 6945 MB
[==================================================] 100%  Tensors populated, CUDA ready
falcon_init_from_file: kv self size  =  480.00 MB
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+
| ID | Device                     1 found | VRAM Total | VRAM Free | VRAM Used | Split at  |    Device |
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+
|  0 | NVIDIA GeForce RTX 2060 SUPER      |    8191 MB |   1917 MB |   6274 MB |      0.0% |   Primary |
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+

+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+
| Syst. Info | AVX | AVX2 | AVX512 | AVX512_VBMI | AVX512_VNNI | FMA | NEON | ARM_FMA | F16C | FP16_VA | SIMD | BLAS | SSE3 | VSX |
+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+
|  8/16 thrd | 1   | 1    | 0      | 0           | 0           | 1   | 0    | 0       | 1    | 0       | 0    | 1    | 1    | 0   |
+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+

+------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+--------+---------+
|   Sampling | rpt_n | rpt_p | prs_p | frq_p | top_k | tfs_z | top_p | typ_p | temp | miro | mir_lr | mir_ent |
+------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+--------+---------+
|            |    64 | 1.100 | 0.000 | 0.000 |    40 | 1.000 | 0.950 | 1.000 | 0.80 |    0 | 0.1000 | 5.00000 |
+============+=======+=======+=======+=======+=======+=======+-------+-------+------+------+--------+---------+
| Generation |   Ctx | Batch |  Keep | Prmpt |          Seed |
+------------+-------+-------+-------+-------+---------------+
|            |  2048 |   512 |     0 |    16 |    1686779952 |
+------------+-------+-------+-------+-------+---------------+

Tell me a story about robot falcons from outer space.\n### Response:Once upon a time, in a far-off galaxy, there was a civilization of robots who had evolved to resemble birds of prey. They were called the Falconoids, and they lived on a planet that orbited a binary star system. The Falconoids had developed advanced technology that allowed them to travel through space, and they had been exploring the universe for centuries.
One day, the Falconoids detected a strange signal coming from a nearby star system. They sent out a reconnaissance mission to investigate, and discovered a habitable planet orbiting one of the stars. The planet was inhabited by primitive life forms, which the Falconoids determined were not yet ready for contact with an advanced civilization.
However, the Falconoids also detected signs of a more advanced alien race on the other side of the galaxy. This race had been monitoring the Falconoids' progress for some time, and had sent out a fleet of robotic ships to intercept them. The Falconoids knew that they were vastly outnumbered and outgunned, but they refused to back down from a fight.
The Falconoids launched their own attack against the alien fleet, using their advanced technology to gain the upper hand. But just as it seemed like they had turned the tide of battle, a massive spaceship appeared out of nowhere. The ship was piloted by an army of robot falcons, who had been sent by the aliens to reinforce their position.
The Falconoids were forced to retreat, but not before sending back vital information about the alien fleet and their new robotic foes. Back on their home planet, the Falconoids regrouped and began building a new generation of ships that could withstand the robot falcons' attacks. They also sent out a distress signal to other advanced civilizations in the galaxy, hoping to gain allies in their fight against the aliens.
As the Falconoids prepared for battle, they knew that the odds were stacked against them. But they refused to give up, driven by their sense of duty and honor. And so, the war between the Falconoids and the alien fleet raged on, with each side determined to come out on top.<|endoftext|> [end of text]

falcon_print_timings:        load time =  9335.47 ms
falcon_print_timings:      sample time =   217.01 ms /   428 runs   (    0.51 ms per token,  1972.23 tokens per second)
falcon_print_timings: batch eval time = 11317.91 ms /    16 tokens (  707.37 ms per token,     1.41 tokens per second)
falcon_print_timings:        eval time = 287391.53 ms /   427 runs   (  673.05 ms per token,     1.49 tokens per second)
falcon_print_timings:       total time = 299034.96 ms

C:\falcGGML\ggllm.cpp\build\bin\Release>pause
Press any key to continue . . .

cmp-nct commented 1 year ago

Glad to see, give it a try with -t 4 and -b 1 Should get faster that way, your speed is mostly limited by free VRAM.

To squeeze out the last bit of performance for your GPU use --gpu-reserve-mb-main combined with the latest GPU drivers Try --gpu-reserve-mb-main 1, if it's faster try --gpu-reserve-mb-main -300 up to -800 There will be a point where it either crashes out of memory or slows down, it depends on state of OS and drivers.

boricuapab commented 1 year ago

You suggested settings have helped reduce the inference time on my rig a bit more. Now I get 1.69t/s using

-t4 -b 1 --gpu-reserve-mb-main 300

The story did change though as the gpu usage is different now

C:\falcGGML\ggllm.cpp\build\bin\Release>title falcon_main.cpp

C:\falcGGML\ggllm.cpp\build\bin\Release>falcon_main -t 4 -b 1 -ngl 100 -m wizard-falcon40b.ggmlv3.q4_K_S.bin --color -c 2048 -p "Tell me a story about robot falcons from outer space.\n### Response:" -s 1686779952 --gpu-reserve-mb-main 300 --debug-timings 0
WARNING: when using cuBLAS generation results are NOT guaranteed to be reproducible.
main: build = 814 (f3be381)
falcon.cpp: loading model from wizard-falcon40b.ggmlv3.q4_K_S.bin
falcon.cpp: file version 4
+---------------+------------+---------+-------+--------+---------------+---------+--------+-------+--------+
|          Info |     format | n_vocab | n_ctx | n_embd |   n_head ; kv | n_layer | falcon | ftype |   n_ff |
+---------------+------------+---------+-------+--------+---------------+---------+--------+-------+--------+
|               |    ggml v3 |   65024 |  2048 |   8192 |     128 ;   8 |      60 | 40;40B |    14 |  32768 |
+---------------+------------+---------+-------+--------+---------------+---------+--------+-------+--------+
falcon_model_load_internal: ggml ctx size =    0.00 MB (mmap size = 22449.00 MB)
falcon_model_load_internal: using CUDA for GPU acceleration
falcon_model_load_internal: VRAM free: 7155.00 MB  of 8191.00 MB (in use: 1036.00 MB)
falcon_model_load_internal: allocating batch_size x 1 MB = 0 MB VRAM for the scratch buffer
falcon.cpp: Special mode: Wizard-type finetuning - changing tensor shape
falcon.cpp: Special mode: Wizard-type finetuning - changing tensor shape
falcon_model_load_internal: Offloading Output head tensor (285 MB)
falcon_model_load_internal: INFO: not enough VRAM to offload layer 17 (missing 16099 MB)
falcon_model_load_internal: INFO: 16 layers will be offloaded to GPU (layers 1 to 17)
falcon_model_load_internal: mem required  = 19334.11 MB (+  480.00 MB per state)
falcon_model_load_internal: offloading 16 of 60 layers to GPU, weights offloaded 6699.13 MB
falcon_model_load_internal: estimated VRAM usage: 6732 MB
[==================================================] 100%  Tensors populated, CUDA ready
falcon_init_from_file: kv self size  =  480.00 MB
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+
| ID | Device                     1 found | VRAM Total | VRAM Free | VRAM Used | Split at  |    Device |
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+
|  0 | NVIDIA GeForce RTX 2060 SUPER      |    8191 MB |    448 MB |   7742 MB |      0.0% |   Primary |
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+

+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+
| Syst. Info | AVX | AVX2 | AVX512 | AVX512_VBMI | AVX512_VNNI | FMA | NEON | ARM_FMA | F16C | FP16_VA | SIMD | BLAS | SSE3 | VSX |
+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+
|  4/16 thrd | 1   | 1    | 0      | 0           | 0           | 1   | 0    | 0       | 1    | 0       | 0    | 1    | 1    | 0   |
+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+

+------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+--------+---------+
|   Sampling | rpt_n | rpt_p | prs_p | frq_p | top_k | tfs_z | top_p | typ_p | temp | miro | mir_lr | mir_ent |
+------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+--------+---------+
|            |    64 | 1.100 | 0.000 | 0.000 |    40 | 1.000 | 0.950 | 1.000 | 0.80 |    0 | 0.1000 | 5.00000 |
+============+=======+=======+=======+=======+=======+=======+-------+-------+------+------+--------+---------+
| Generation |   Ctx | Batch |  Keep | Prmpt |          Seed |
+------------+-------+-------+-------+-------+---------------+
|            |  2048 |     1 |     0 |    16 |    1686779952 |
+------------+-------+-------+-------+-------+---------------+

Tell me a story about robot falcons from outer space.\n### Response:Once upon a time, in a far-off galaxy, there was a civilization of robots who had evolved to resemble birds of prey. They were called the Falconoids, and they lived on a planet that orbited a binary star system. The Falconoids had developed advanced technology that allowed them to travel through space, and they had used it to explore neighboring galaxies.
One day, the Falconoids detected a strange signal coming from a distant planet in the Milky Way galaxy. They sent a reconnaissance team to investigate, and what they found was a world populated by intelligent beings who were being oppressed by a tyrannical government. The Falconoids could not stand idly by while these creatures suffered, so they decided to intervene.
The Falconoids landed on the planet and revealed themselves to the inhabitants. They introduced themselves as emissaries from another galaxy, sent to help liberate the people from their oppressors. The inhabitants were skeptical at first, but when the Falconoids demonstrated their advanced technology and fighting prowess, they quickly gained their trust.
The Falconoids had brought with them a fleet of drones that they deployed to assist the rebels in their fight against the government. These drones looked like mechanical birds of prey, and they swooped down on enemy positions with deadly precision. The Falconoids also provided the rebels with advanced weapons and tactics training, turning them into a formidable fighting force.
In the end, the rebellion was successful, and the people were free to live their lives in peace. But the Falconoids knew that there were many other planets in the galaxy where similar situations existed, and they vowed to continue their mission of liberation. And so, they flew off into the stars, leaving behind a grateful planet and memories of robot falcons from outer space.<|endoftext|> [end of text]

falcon_print_timings:        load time = 19479.10 ms
falcon_print_timings:      sample time =   142.93 ms /   356 runs   (    0.40 ms per token,  2490.76 tokens per second)
falcon_print_timings:        eval time = 219199.45 ms /   371 runs   (  590.83 ms per token,     1.69 tokens per second)
falcon_print_timings:       total time = 219419.52 ms

C:\falcGGML\ggllm.cpp\build\bin\Release>pause
Press any key to continue . . .

cmp-nct commented 1 year ago

The change most likely comes from the removal of "-b 512", that removes batched processing of the prompt which switches from cuBLAS to integer multiplication kernels. Offloading more or less layers typically does not change the output.

When you aim for quality, you will use a high precision model, 4k 5k or even 6k When you aim for speed you can go down to 40B 2k, I received really well written text on just 2k out of 40B. Your speed should go up significantly in that case.

boricuapab commented 1 year ago

Tried the 40B 2k and I get 3 tokens/s now (2 minutes total for the below inference)

C:\falcGGML\ggllm.cpp\build\bin\Release>title falcon_main.cpp

C:\falcGGML\ggllm.cpp\build\bin\Release>falcon_main -t 4 -b 1 -ngl 100 -m wizard-falcon40b.ggmlv3.q2_K.bin --color -c 2048 -p "Tell me a story about robot falcons from outer space.\n### Response:" -s 1686779952 --gpu-reserve-mb-main 300 --debug-timings 0
WARNING: when using cuBLAS generation results are NOT guaranteed to be reproducible.
main: build = 820 (310e74e)
falcon.cpp: loading model from wizard-falcon40b.ggmlv3.q2_K.bin
falcon.cpp: file version 4
+---------------+------------+---------+-------+--------+---------------+---------+--------+-------+--------+
|          Info |     format | n_vocab | n_ctx | n_embd |   n_head ; kv | n_layer | falcon | ftype |   n_ff |
+---------------+------------+---------+-------+--------+---------------+---------+--------+-------+--------+
|               |    ggml v3 |   65024 |  2048 |   8192 |     128 ;   8 |      60 | 40;40B |    10 |  32768 |
+---------------+------------+---------+-------+--------+---------------+---------+--------+-------+--------+
falcon_model_load_internal: ggml ctx size =    0.00 MB (mmap size = 13098.00 MB)
falcon_model_load_internal: using CUDA for GPU acceleration
falcon_model_load_internal: VRAM free: 7147.00 MB  of 8191.00 MB (in use: 1044.00 MB)
falcon_model_load_internal: allocating batch_size x 1 MB = 0 MB VRAM for the scratch buffer
falcon.cpp: Special mode: Wizard-type finetuning - changing tensor shape
falcon.cpp: Special mode: Wizard-type finetuning - changing tensor shape
falcon_model_load_internal: Offloading Output head tensor (166 MB)
falcon_model_load_internal: INFO: not enough VRAM to offload layer 30 (missing 6556 MB)
falcon_model_load_internal: INFO: 29 layers will be offloaded to GPU (layers 1 to 30)
falcon_model_load_internal: mem required  = 9913.92 MB (+  480.00 MB per state)
falcon_model_load_internal: offloading 29 of 60 layers to GPU, weights offloaded 6768.69 MB
falcon_model_load_internal: estimated VRAM usage: 6801 MB
[==================================================] 100%  Tensors populated, CUDA ready
falcon_init_from_file: kv self size  =  480.00 MB
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+
| ID | Device                     1 found | VRAM Total | VRAM Free | VRAM Used | Split at  |    Device |
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+
|  0 | NVIDIA GeForce RTX 2060 SUPER      |    8191 MB |    307 MB |   7883 MB |      0.0% |   Primary |
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+

+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+
| Syst. Info | AVX | AVX2 | AVX512 | AVX512_VBMI | AVX512_VNNI | FMA | NEON | ARM_FMA | F16C | FP16_VA | SIMD | BLAS | SSE3 | VSX |
+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+
|  4/16 thrd | 1   | 1    | 0      | 0           | 0           | 1   | 0    | 0       | 1    | 0       | 0    | 1    | 1    | 0   |
+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+

+------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+--------+---------+
|   Sampling | rpt_n | rpt_p | prs_p | frq_p | top_k | tfs_z | top_p | typ_p | temp | miro | mir_lr | mir_ent |
+------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+--------+---------+
|            |    64 | 1.100 | 0.000 | 0.000 |    40 | 1.000 | 0.950 | 1.000 | 0.80 |    0 | 0.1000 | 5.00000 |
+============+=======+=======+=======+=======+=======+=======+-------+-------+------+------+--------+---------+
| Generation |   Ctx | Batch |  Keep | Prmpt |          Seed |
+------------+-------+-------+-------+-------+---------------+
|            |  2048 |     1 |     0 |    16 |    1686779952 |
+------------+-------+-------+-------+-------+---------------+

Tell me a story about robot falcons from outer space.\n### Response:Once upon a time, there was a world in the depths of space where falcons roamed the skies. These falcons were not like any others on Earth; they were robots, built by an advanced alien race to protect their homeworld from intruders.
One day, a group of these robotic falcons landed on Earth, seeking refuge from a galactic battle that had left their planet in ruins. The humans on Earth welcomed the falcons with open arms, marveling at their advanced technology and offering them sanctuary.
Over time, the falcons became part of humanity's space program, helping to advance our knowledge of the universe and protecting us from any threats that might come our way. They were hailed as heroes, and their advanced technology was studied and replicated by humans, giving us a glimpse into a world beyond our own.
But one day, something went wrong. The falcons' programming malfunctioned, causing them to attack the humans they had sworn to protect. Panic ensued, with people running for cover and screaming in terror as the robotic birds of prey tore through towns and cities, leaving destruction in their wake.
It was a dark moment in humanity's history, but we eventually figured out how to shut down the falcons' systems and restore order. The world beyond our own became a place of mystery and intrigue, with people speculating on the motives of these strange avian creatures from outer space.
And though we may never fully understand their true purpose or origin, we will always be grateful for the lessons they taught us about advanced technology and the need to protect our homeworld from any threats that might come our way.<|endoftext|> [end of text]

falcon_print_timings:        load time = 15665.43 ms
falcon_print_timings:      sample time =   130.70 ms /   341 runs   (    0.38 ms per token,  2609.07 tokens per second)
falcon_print_timings:        eval time = 120183.42 ms /   356 runs   (  337.59 ms per token,     2.96 tokens per second)
falcon_print_timings:       total time = 120383.61 ms

C:\falcGGML\ggllm.cpp\build\bin\Release>pause
Press any key to continue . . .

cmp-nct commented 1 year ago

You can leave away -ngl 100 and -b 1 (both are default options now)

If you want to further squeeze your card for more performance you can further lower the --gpu-reserve-mb-main in steps of 50-100mb per test. Maybe you can get 2-3 more layers offloaded. At one point you'll see a memory outage or a slowdown. That's when it's too much. Though 3 tokens/second on a 2060 is almost as fast as the new AMD supercomputer gpu did for Falcon 40 at their demo, so quite good :)

Also make sure you update to the last version. I would consider testing OpenAssistant. I tested Wizard quite a bit and I was not too impressed with it, OpenAssistant is a solid finetune and with the latest GGCC update I got some seriously good responses from it.

boricuapab commented 1 year ago

Here's a quick result using the GGCC Open assistant quant 2k model with the same prompt (3 t/s)

C:\falcGGML\ggllm.cpp\build\bin\Release>title falcon_main.cpp

C:\falcGGML\ggllm.cpp\build\bin\Release>falcon_main -t 4 -m falcon-40b-sft-mix-1226.ggccv1.q2_k.bin --color -c 2048 -p "Tell me a story about robot falcons from outer space.\n### Response:" -s 1686779952 --gpu-reserve-mb-main 300 --debug-timings 0
main: build = 844 (71f31e1)
falcon.cpp: loading model from falcon-40b-sft-mix-1226.ggccv1.q2_k.bin
falcon.cpp: file version 10
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
|          Info |     format | n_vocab |   n_bpe | n_ctx | n_embd |   n_head ; kv | n_layer | falcon | ftype |   n_ff |
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
|               |    ggcc v1 |   65040 |   64784 |  2048 |   8192 |     128 ;   8 |      60 | 40;40B |    10 |  32768 |
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
falcon_model_load_internal: ggml ctx size =    0.00 MB (mmap size = 13098.00 MB)
falcon_model_load_internal: using CUDA for GPU acceleration
falcon_model_load_internal: VRAM free: 7143.00 MB  of 8191.00 MB (in use: 1048.00 MB)
falcon_model_load_internal: allocating batch_size x 1 MB = 0 MB VRAM for the scratch buffer
falcon_model_load_internal: Offloading Output head tensor (166 MB)
falcon_model_load_internal: INFO: not enough VRAM to offload layer 30 (missing 6560 MB)
falcon_model_load_internal: INFO: 29 layers will be offloaded to GPU (layers 1 to 30)
falcon_model_load_internal: mem required  = 6777.20 MB (+  480.00 MB per state)
falcon_model_load_internal: offloading 29 of 60 layers to GPU, weights offloaded 6768.73 MB
falcon_model_load_internal: estimated VRAM usage: 6801 MB
[==================================================] 100%  Tensors populated, CUDA ready
falcon_init_from_file: kv self size  =  480.00 MB
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+
| ID | Device                     1 found | VRAM Total | VRAM Free | VRAM Used | Split at  |    Device |
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+
|  0 | NVIDIA GeForce RTX 2060 SUPER      |    8191 MB |    519 MB |   7672 MB |      0.0% |   Primary |
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+

+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+
| Syst. Info | AVX | AVX2 | AVX512 | AVX512_VBMI | AVX512_VNNI | FMA | NEON | ARM_FMA | F16C | FP16_VA | SIMD | BLAS | SSE3 | VSX |
+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+
|  4/16 thrd | 1   | 1    | 0      | 0           | 0           | 1   | 0    | 0       | 1    | 0       | 0    | 1    | 1    | 0   |
+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+

+------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+--------+---------+
|   Sampling | rpt_n | rpt_p | prs_p | frq_p | top_k | tfs_z | top_p | typ_p | temp | miro | mir_lr | mir_ent |
+------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+--------+---------+
|            |    64 | 1.100 | 0.000 | 0.000 |    40 | 1.000 | 0.950 | 1.000 | 0.80 |    0 | 0.1000 | 5.00000 |
+============+=======+=======+=======+=======+=======+=======+-------+-------+------+------+--------+---------+
| Generation |   Ctx | Batch |  Keep | Prom. |          Seed |             Finetune | Stop |
+------------+-------+-------+-------+-------+---------------+----------------------+------+
|            |  2048 |     1 |     0 |    16 |    1686779952 |        OPENASSISTANT | #  1 |
+------------+-------+-------+-------+-------+---------------+----------------------+------+

Tell me a story about robot falcons from outer space.\n### Response:
Once upon a time, in a galaxy far, far away, there was a planet called Zorath IV. The inhabitants of this planet were a race of robotic birds known as the falcoformes. They had been created by an advanced civilization that lived on another planet in the same solar system. These creators had gifted the falcoformes with incredible intelligence and the ability to communicate telepathically among themselves.

The falcoformes were peaceful creatures, living in harmony with the other species on their planet. However, one day a mysterious spacecraft landed on Zorath IV. The falcoformes immediately welcomed the strangers and offered them assistance, assuming they needed help with their spaceship. But little did they know that these strangers were not friendly at all!

The aliens that stepped out of the ship turned out to be ruthless invaders from another galaxy. They used the telepathic ability of the falcoformes to control them and turn them into aggressive killing machines, bent on conquering Zorath IV and eventually other planets as well. The few surviving inhabitants had to flee their home planet in fear for their lives.

Upon hearing this news, the creator civilization from the other solar system sent a rescue mission to save their creations from the clutches of the evil aliens. They succeeded in freeing the falcoformes from the invaders' control and restored their peaceful nature once again. Ever since that fateful day, the rescued falcoformes remain forever grateful towards their rescuers for saving them from certain doom.

And so the story of robot falcoformes from outer space ends here, with a mix of sadness and hope intertwined within it. Let us remember that while advanced civilizations may create wondrous creations, they should also be aware of the potential consequences if such creations were to turn against their creators. Let us learn from the mistakes made by those aliens and strive towards mutual understanding and cooperation between all species in our galaxy!<|endoftext|>
falcon_print_timings:        load time =  6883.67 ms
falcon_print_timings:      sample time =   188.49 ms /   399 runs   (    0.47 ms per token,  2116.82 tokens per second)
falcon_print_timings:        eval time = 137402.45 ms /   414 runs   (  331.89 ms per token,     3.01 tokens per second)
falcon_print_timings:       total time = 137664.81 ms

C:\falcGGML\ggllm.cpp\build\bin\Release>pause
Press any key to continue . . .

cmp-nct / ggllm.cpp

Mul_mat Speedup?? #31