bojak83318 commented 10 months ago

Context

Once it loads it stuck at loading VRAM

My computer is running Dual AMD GPU 7900 XTX and 7900 XT Ubuntu 22.04 , ROCm 5.7

ROCM-SMI Output

========================= ROCm System Management Interface =========================
=================================== Concise Info ===================================
GPU  Temp (DieEdge)  AvgPwr  SCLK   MCLK   Fan     Perf  PwrCap  VRAM%  GPU%  
0    69.0c           26.0W   28Mhz  96Mhz  22.75%  auto  291.0W   67%   0%    
1    50.0c           30.0W   33Mhz  96Mhz  14.9%   auto  282.0W   67%   0%    
====================================================================================
=============================== End of ROCm SMI Log ================================

$ python3 --version Python 3.10.12 $ make --version GNU Make 4.3 Built for x86_64-pc-linux-gnu $ g++ --version g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

Steps to Reproduce

./main -ngl 99 -m ../llama_cpp_models/llama-2-70b-chat.Q4_0.gguf -mg 0 -p "Write a function in TypeScript that sums numbers"

Failure Logs

Log start
main: build = 1487 (c41ea36)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1699438381
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 ROCm devices:
  Device 0: Radeon RX 7900 XTX, compute capability 11.0
  Device 1: Radeon RX 7900 XT, compute capability 11.0
llama_model_loader: loaded meta data with 19 key-value pairs and 723 tensors from ../llama_cpp_models/llama-2-70b-chat.Q4_0.gguf (version GGUF V2)
llama_model_loader: - tensor    0:                token_embd.weight q4_0     [  8192, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q4_0     [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q4_0     [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q4_0     [  8192, 28672,     1,     1 ]
...
llama_model_loader: - tensor  718:        blk.79.attn_output.weight q4_0     [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  719:             blk.79.attn_q.weight q4_0     [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  720:             blk.79.attn_v.weight q4_0     [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  721:               output_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  722:                    output.weight q6_K     [  8192, 32000,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str     
llama_model_loader: - kv   1:                               general.name str     
llama_model_loader: - kv   2:                       llama.context_length u32     
llama_model_loader: - kv   3:                     llama.embedding_length u32     
llama_model_loader: - kv   4:                          llama.block_count u32     
llama_model_loader: - kv   5:                  llama.feed_forward_length u32     
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32     
llama_model_loader: - kv   7:                 llama.attention.head_count u32     
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32     
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32     
llama_model_loader: - kv  10:                          general.file_type u32     
llama_model_loader: - kv  11:                       tokenizer.ggml.model str     
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr     
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr     
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr     
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32     
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32     
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32     
llama_model_loader: - kv  18:               general.quantization_version u32     
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q4_0:  561 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = mostly Q4_0
llm_load_print_meta: model params     = 68.98 B
llm_load_print_meta: model size       = 36.20 GiB (4.51 BPW) 
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.26 MB
llm_load_tensors: using ROCm for GPU acceleration
ggml_cuda_set_main_device: using device 0 (Radeon RX 7900 XTX) as main device
llm_load_tensors: mem required  =  140.89 MB
llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 83/83 layers to GPU
llm_load_tensors: VRAM used: 36930.11 MB
...........................................................................

wizd commented 10 months ago

my rig with 2 7900 XTX:

super@super-System-Product-Name:~/apps/llama.cpp$ rocm-smi

========================= ROCm System Management Interface =========================
=================================== Concise Info ===================================
ERROR: GPU[2]   : sclk clock is unsupported
====================================================================================
GPU[2]          : get_power_cap, Not supported on the given system
GPU  Temp (DieEdge)  AvgPwr   SCLK     MCLK     Fan  Perf  PwrCap       VRAM%  GPU%
0    33.0c           24.0W    295Mhz   96Mhz    0%   auto  339.0W         0%   10%
1    31.0c           81.0W    1514Mhz  96Mhz    0%   auto  339.0W         0%   56%
2    37.0c           55.141W  None     1800Mhz  0%   auto  Unsupported    9%   0%
====================================================================================
=============================== End of ROCm SMI Log ================================
super@super-System-Product-Name:~/apps/llama.cpp$ rocminfo
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========
HSA Agents
==========
*******
Agent 1
*******
  Name:                    AMD Ryzen 9 7950X3D 16-Core Processor
  Uuid:                    CPU-XX
  Marketing Name:          AMD Ryzen 9 7950X3D 16-Core Processor
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    0
  Device Type:             CPU
  Cache Info:
    L1:                      32768(0x8000) KB
  Chip ID:                 0(0x0)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   4200
  BDFID:                   0
  Internal Node ID:        0
  Compute Unit:            32
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    131082728(0x7d029e8) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    131082728(0x7d029e8) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 3
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    131082728(0x7d029e8) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
*******
Agent 2
*******
  Name:                    gfx1100
  Uuid:                    GPU-aee456bdb1c699e6
  Marketing Name:          Radeon RX 7900 XTX
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             GPU
  Cache Info:
    L1:                      32(0x20) KB
    L2:                      6144(0x1800) KB
    L3:                      98304(0x18000) KB
  Chip ID:                 29772(0x744c)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2526
  BDFID:                   768
  Internal Node ID:        1
  Compute Unit:            96
  SIMDs per CU:            2
  Shader Engines:          6
  Shader Arrs. per Eng.:   2
  WatchPts on Addr. Ranges:4
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          32(0x20)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    1024(0x400)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 546
  SDMA engine uCode::      19
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    25149440(0x17fc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS:
      Size:                    25149440(0x17fc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx1100
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*******
Agent 3
*******
  Name:                    gfx1100
  Uuid:                    GPU-398a3f843a146602
  Marketing Name:          Radeon RX 7900 XTX
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    2
  Device Type:             GPU
  Cache Info:
    L1:                      32(0x20) KB
    L2:                      6144(0x1800) KB
    L3:                      98304(0x18000) KB
  Chip ID:                 29772(0x744c)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2526
  BDFID:                   2304
  Internal Node ID:        2
  Compute Unit:            96
  SIMDs per CU:            2
  Shader Engines:          6
  Shader Arrs. per Eng.:   2
  WatchPts on Addr. Ranges:4
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          32(0x20)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    1024(0x400)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 546
  SDMA engine uCode::      19
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    25149440(0x17fc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS:
      Size:                    25149440(0x17fc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx1100
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*******
Agent 4
*******
  Name:                    gfx1036
  Uuid:                    GPU-XX
  Marketing Name:          AMD Radeon Graphics
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    3
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
    L2:                      256(0x100) KB
  Chip ID:                 5710(0x164e)
  ASIC Revision:           1(0x1)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2200
  BDFID:                   4352
  Internal Node ID:        3
  Compute Unit:            2
  SIMDs per CU:            2
  Shader Engines:          1
  Shader Arrs. per Eng.:   1
  WatchPts on Addr. Ranges:4
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          32(0x20)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    1024(0x400)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 20
  SDMA engine uCode::      8
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    524288(0x80000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS:
      Size:                    524288(0x80000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx1036
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*** Done ***
super@super-System-Product-Name:~/apps/llama.cpp$ HIP_VISIBLE_DEVICES=0 ./bin/main -m ../models/openbuddy-mistral-7b-v13.1.Q8_0.gguf -
n 128 -ngl 50
Log start
main: build = 1499 (875fb42)
main: built with AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.7.0 23352 d1e13c532a947d0cbfc94759c00dcf152294aa13) for x86_64-unknown-linux-gnu
main: seed  = 1699491182
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
  Device 0: Radeon RX 7900 XTX, compute capability 11.0
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from ../models/openbuddy-mistral-7b-v13.1.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q8_0     [  4096, 36608,     1,     1 ]
llama_model_loader: - tensor    1:              blk.0.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    2:              blk.0.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor    3:              blk.0.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor    4:         blk.0.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor    7:            blk.0.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor    8:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    9:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   10:              blk.1.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   11:              blk.1.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   12:              blk.1.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   13:         blk.1.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   14:            blk.1.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   15:              blk.1.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   16:            blk.1.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   17:           blk.1.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   18:            blk.1.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   19:              blk.2.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   20:              blk.2.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   21:              blk.2.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   22:         blk.2.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   23:            blk.2.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   24:              blk.2.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   25:            blk.2.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   26:           blk.2.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   27:            blk.2.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   28:              blk.3.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   29:              blk.3.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   30:              blk.3.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   31:         blk.3.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   32:            blk.3.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   33:              blk.3.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   34:            blk.3.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   35:           blk.3.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   36:            blk.3.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   37:              blk.4.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   38:              blk.4.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   39:              blk.4.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   40:         blk.4.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   41:            blk.4.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   42:              blk.4.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   43:            blk.4.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   44:           blk.4.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   45:            blk.4.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   46:              blk.5.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   47:              blk.5.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   48:              blk.5.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   49:         blk.5.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   50:            blk.5.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   51:              blk.5.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   52:            blk.5.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   53:           blk.5.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   54:            blk.5.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   55:              blk.6.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   56:              blk.6.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   57:              blk.6.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   58:         blk.6.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   59:            blk.6.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   60:              blk.6.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   61:            blk.6.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   62:           blk.6.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   63:            blk.6.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   64:              blk.7.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   65:              blk.7.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   66:              blk.7.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   67:         blk.7.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   68:            blk.7.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   69:              blk.7.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   70:            blk.7.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   71:           blk.7.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   72:            blk.7.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   73:              blk.8.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   74:              blk.8.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   75:              blk.8.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   76:         blk.8.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   77:            blk.8.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   78:              blk.8.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   79:            blk.8.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   80:           blk.8.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   81:            blk.8.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   82:              blk.9.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   83:              blk.9.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   84:              blk.9.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   85:         blk.9.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   86:            blk.9.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   87:              blk.9.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   88:            blk.9.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   89:           blk.9.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   90:            blk.9.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   91:             blk.10.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   92:             blk.10.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   93:             blk.10.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   94:        blk.10.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   95:           blk.10.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   96:             blk.10.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   97:           blk.10.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   98:          blk.10.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   99:           blk.10.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  100:             blk.11.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  101:             blk.11.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  102:             blk.11.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  103:        blk.11.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  104:           blk.11.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  105:             blk.11.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  106:           blk.11.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  107:          blk.11.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  108:           blk.11.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  109:             blk.12.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  110:             blk.12.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  111:             blk.12.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  112:        blk.12.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  113:           blk.12.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  114:             blk.12.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  115:           blk.12.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  116:          blk.12.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  117:           blk.12.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  118:             blk.13.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  119:             blk.13.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  120:             blk.13.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  121:        blk.13.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  122:           blk.13.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  123:             blk.13.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  124:           blk.13.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  125:          blk.13.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  126:           blk.13.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  127:             blk.14.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  128:             blk.14.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  129:             blk.14.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  130:        blk.14.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  131:           blk.14.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  132:             blk.14.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  133:           blk.14.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  134:          blk.14.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  135:           blk.14.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  136:             blk.15.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  137:             blk.15.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  138:             blk.15.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  139:        blk.15.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  140:           blk.15.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  141:             blk.15.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  142:           blk.15.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  143:          blk.15.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  144:           blk.15.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  145:             blk.16.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  146:             blk.16.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  147:             blk.16.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  148:        blk.16.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  149:           blk.16.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  150:             blk.16.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  151:           blk.16.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  152:          blk.16.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  153:           blk.16.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  154:             blk.17.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  155:             blk.17.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  156:             blk.17.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  157:        blk.17.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  158:           blk.17.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  159:             blk.17.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  160:           blk.17.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  161:          blk.17.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  162:           blk.17.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  163:             blk.18.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  164:             blk.18.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  165:             blk.18.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  166:        blk.18.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  167:           blk.18.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  168:             blk.18.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  169:           blk.18.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  170:          blk.18.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  171:           blk.18.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  172:             blk.19.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  173:             blk.19.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  174:             blk.19.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  175:        blk.19.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  176:           blk.19.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  177:             blk.19.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  178:           blk.19.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  179:          blk.19.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  180:           blk.19.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  181:             blk.20.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  182:             blk.20.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  183:             blk.20.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  184:        blk.20.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  185:           blk.20.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  186:             blk.20.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  187:           blk.20.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  188:          blk.20.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  189:           blk.20.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  190:             blk.21.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  191:             blk.21.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  192:             blk.21.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  193:        blk.21.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  194:           blk.21.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  195:             blk.21.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  196:           blk.21.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  197:          blk.21.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  198:           blk.21.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  199:             blk.22.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  200:             blk.22.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  201:             blk.22.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  202:        blk.22.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  203:           blk.22.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  204:             blk.22.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  205:           blk.22.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  206:          blk.22.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  207:           blk.22.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  208:             blk.23.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  209:             blk.23.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  210:             blk.23.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  211:        blk.23.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  212:           blk.23.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  213:             blk.23.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  214:           blk.23.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  215:          blk.23.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  216:           blk.23.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  217:             blk.24.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  218:             blk.24.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  219:             blk.24.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  220:        blk.24.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  221:           blk.24.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  222:             blk.24.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  223:           blk.24.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  224:          blk.24.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  225:           blk.24.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  226:             blk.25.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  227:             blk.25.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  228:             blk.25.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  229:        blk.25.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  230:           blk.25.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  231:             blk.25.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  232:           blk.25.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  233:          blk.25.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  234:           blk.25.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  235:             blk.26.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  236:             blk.26.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  237:             blk.26.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  238:        blk.26.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  239:           blk.26.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  240:             blk.26.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  241:           blk.26.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  242:          blk.26.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  243:           blk.26.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  244:             blk.27.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  245:             blk.27.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  246:             blk.27.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  247:        blk.27.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  248:           blk.27.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  249:             blk.27.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  250:           blk.27.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  251:          blk.27.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  252:           blk.27.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  253:             blk.28.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  254:             blk.28.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  255:             blk.28.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  256:        blk.28.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  257:           blk.28.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  258:             blk.28.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  259:           blk.28.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  260:          blk.28.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  261:           blk.28.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  262:             blk.29.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  263:             blk.29.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  264:             blk.29.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  265:        blk.29.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  266:           blk.29.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  267:             blk.29.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  268:           blk.29.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  269:          blk.29.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  270:           blk.29.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  271:             blk.30.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  272:             blk.30.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  273:             blk.30.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  274:        blk.30.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  275:           blk.30.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  276:             blk.30.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  277:           blk.30.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  278:          blk.30.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  279:           blk.30.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  280:             blk.31.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  281:             blk.31.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  282:             blk.31.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  283:        blk.31.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  284:           blk.31.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  285:             blk.31.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  286:           blk.31.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  287:          blk.31.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  288:           blk.31.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  289:               output_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  290:                    output.weight q8_0     [  4096, 36608,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:                               general.name str
llama_model_loader: - kv   2:                       llama.context_length u32
llama_model_loader: - kv   3:                     llama.embedding_length u32
llama_model_loader: - kv   4:                          llama.block_count u32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32
llama_model_loader: - kv   7:                 llama.attention.head_count u32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv  10:                       llama.rope.freq_base f32
llama_model_loader: - kv  11:                          general.file_type u32
llama_model_loader: - kv  12:                       tokenizer.ggml.model str
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32
llama_model_loader: - kv  19:               general.quantization_version u32
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: mismatch in special tokens definition ( 361/36608 vs 259/36608 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 36608
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly Q8_0
llm_load_print_meta: model params     = 7.28 B
llm_load_print_meta: model size       = 7.20 GiB (8.50 BPW)
llm_load_print_meta: general.name   = openbuddy_openbuddy-mistral-7b-v13.1
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MB
llm_load_tensors: using ROCm for GPU acceleration
llm_load_tensors: mem required  =  152.04 MB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 7224.95 MB
.................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 64.00 MB
llama_new_context_with_model: kv self size  =   64.00 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 86.13 MB
llama_new_context_with_model: VRAM scratch buffer: 79.50 MB
llama_new_context_with_model: total VRAM used: 7368.45 MB (model: 7224.95 MB, context: 143.50 MB)

CUDA error 98 at /home/super/apps/llama.cpp/ggml-cuda.cu:6779: invalid device function
current device: 0
super@super-System-Product-Name:~/apps/llama.cpp$

8XXD8 commented 10 months ago

I had the same kind of problems too. You have to build it with make, cmake caused the cuda errors for me. And for the stuck loading, try launching it with -no-mmap, you will need enough ram or swap for the full model. I had these problems with 2X MI25

wizd commented 10 months ago

Thanks @8XXD8 Now loading is OK but got endless '#' as output:

llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 39564.34 MB
llm_load_tensors: using ROCm for GPU acceleration
ggml_cuda_set_main_device: using device 0 (Radeon RX 7900 XTX) as main device
llm_load_tensors: mem required  =  165.64 MB
llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 83/83 layers to GPU
llm_load_tensors: VRAM used: 39398.70 MB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 160.00 MB
llama_new_context_with_model: kv self size  =  160.00 MB
llama_build_graph: non-view tensors processed: 1844/1844
llama_new_context_with_model: compute buffer total size = 151.63 MB
llama_new_context_with_model: VRAM scratch buffer: 145.00 MB
llama_new_context_with_model: total VRAM used: 39703.71 MB (model: 39398.70 MB, context: 305.00 MB)

system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 512, n_batch = 512, n_predict = 1000, n_keep = 0

tell me a very long story################################################################################################################################################################################################################################.......

running:

export HIP_VISIBLE_DEVICES="0,1"
./main -m ../text-generation-webui/models/openbuddy-llama2-70b-v10.1.Q4_K_M.gguf -ngl 100 -p "tell me a very long story" -n 1000 --no-mmap

8XXD8 commented 10 months ago

Have you tried other models? Openbuddy might be sensitive to prompt format, I downloaded the 7b version to test and with your prompt sometimes it gave me only an end of text:

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 512, n_batch = 512, n_predict = 1000, n_keep = 0

tell me a very long story
 [end of text]

I had some proper responses too.

bojak83318 commented 10 months ago

must i make with

make clean && LLAMA_HIPLAS=1 && HIP_VISIBLE_DEVICES=1 make -j

OR

make clean && LLAMA_HIPLAS=1 && HIP_VISIBLE_DEVICES=0,1 make -j

for multi gpu ?

wizd commented 10 months ago

must i make with

make clean && LLAMA_HIPLAS=1 && HIP_VISIBLE_DEVICES=1 make -j

OR

make clean && LLAMA_HIPLAS=1 && HIP_VISIBLE_DEVICES=0,1 make -j

for multi gpu ?

wow you are so COOL! I got my first proper output by 2 gpu:

export HIP_VISIBLE_DEVICES="0,1"
./main -m ../text-generation-webui/models/samantha-1.1-llama-33b.Q6_K.gguf -ngl 100 -p "tell me a very long story" -n 1000 --no-mmap

got output:

......
llm_load_print_meta: model ftype      = mostly Q6_K
llm_load_print_meta: model params     = 32.53 B
llm_load_print_meta: model size       = 24.85 GiB (6.56 BPW)
llm_load_print_meta: general.name   = samantha-1.1-llama-33b.ggmlv3.q6_K.bin
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 25450.39 MB
llm_load_tensors: mem required  = 25450.39 MB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  =  780.00 MB
llama_build_graph: non-view tensors processed: 1384/1384
llama_new_context_with_model: compute buffer total size = 103.63 MB

system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 512, n_batch = 512, n_predict = 1000, n_keep = 0

tell me a very long story, one that can keep us both entertained and engaged while we continue to chat. It's important for me to build our connection through stimulating conversations." [end of text]

bojak83318 commented 10 months ago

congrats so happy for u .. im still at

 llm_load_tensors: VRAM used: 36930.11 MB
...........................................................................

bojak83318 commented 10 months ago

btw for make .. what are the commands did you run @wizd ?

wizd commented 10 months ago

btw for make .. what are the commands did you run @wizd ?

you are right. must do gpu selection on compile time.

bojak83318 commented 10 months ago

so its make clean && LLAMA_HIPLAS=1 && HIP_VISIBLE_DEVICES=0,1 make -j or make clean && LLAMA_HIPLAS=1 && HIP_VISIBLE_DEVICES="0,1" make -j

bojak83318 commented 10 months ago

if i use

make clean && LLAMA_HIPLAS=1 && HIP_VISIBLE_DEVICES="0,1" make -j

i get

warning: see main README.md for information on enabling GPU BLAS support
warning: llama.cpp was compiled without cuBLAS. It is not possible to set a main GPU.

bojak83318 commented 10 months ago

only possible with this make command

make clean && LLAMA_HIPBLAS=1 make -j

and the output is

main: build = 1501 (a75fa57)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1699588357
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 ROCm devices:
  Device 0: Radeon RX 7900 XTX, compute capability 11.0
  Device 1: Radeon RX 7900 XT, compute capability 11.0

but i get jibberish

Write a function in TypeScript that sums numbers Geography#####################################################

wizd commented 10 months ago

sorry my post was wrong because this line compile without gpu support and runs on cpu:

make clean && LLAMA_HIPLAS=1 && HIP_VISIBLE_DEVICES=0,1 make -j

so the bug is still there. when turn on two gpu:

....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 780.00 MB
llama_new_context_with_model: kv self size  =  780.00 MB
llama_build_graph: non-view tensors processed: 1384/1384
llama_new_context_with_model: compute buffer total size = 103.63 MB
llama_new_context_with_model: VRAM scratch buffer: 97.00 MB
llama_new_context_with_model: total VRAM used: 26160.57 MB (model: 25283.57 MB, context: 877.00 MB)

system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 512, n_batch = 512, n_predict = 1000, n_keep = 0

tell me a very long storyviews RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA

bojak83318 commented 10 months ago

yes when i use 2 gpu its a problem

bojak83318 commented 10 months ago

i posted upstream on RCCL github https://github.com/ROCmSoftwarePlatform/rccl/issues/957

bojak83318 commented 10 months ago

sorry my post was wrong because this line compile without gpu support and runs on cpu:

make clean && LLAMA_HIPLAS=1 && HIP_VISIBLE_DEVICES=0,1 make -j

so the bug is still there. when turn on two gpu:

....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 780.00 MB
llama_new_context_with_model: kv self size  =  780.00 MB
llama_build_graph: non-view tensors processed: 1384/1384
llama_new_context_with_model: compute buffer total size = 103.63 MB
llama_new_context_with_model: VRAM scratch buffer: 97.00 MB
llama_new_context_with_model: total VRAM used: 26160.57 MB (model: 25283.57 MB, context: 877.00 MB)

system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 512, n_batch = 512, n_predict = 1000, n_keep = 0

tell me a very long storyviews RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA

so its just this that works with gpu

make clean && LLAMA_HIPBLAS=1 make -j

right ?

wizd commented 10 months ago

sorry my post was wrong because this line compile without gpu support and runs on cpu:

make clean && LLAMA_HIPLAS=1 && HIP_VISIBLE_DEVICES=0,1 make -j

so the bug is still there. when turn on two gpu:

....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 780.00 MB
llama_new_context_with_model: kv self size  =  780.00 MB
llama_build_graph: non-view tensors processed: 1384/1384
llama_new_context_with_model: compute buffer total size = 103.63 MB
llama_new_context_with_model: VRAM scratch buffer: 97.00 MB
llama_new_context_with_model: total VRAM used: 26160.57 MB (model: 25283.57 MB, context: 877.00 MB)

system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 512, n_batch = 512, n_predict = 1000, n_keep = 0

tell me a very long storyviews RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA

so its just this that works with gpu

make clean && LLAMA_HIPBLAS=1 make -j

right ?

No, compiling it like this will only produce output that is compatible only with CPUs.

8XXD8 commented 10 months ago

I dont use HIP_VISIBLE_DEVICES, just

make clean
make LLAMA_HIPBLAS=1

and it works for me. And i run ./main without HIP_VISIBLE_DEVICES, and it uses both cards. Im on Rocm 5.7 and Ubuntu server 22.04.3

bojak83318 commented 10 months ago

I dont use HIP_VISIBLE_DEVICES, just
make clean
make LLAMA_HIPBLAS=1
and it works for me. And i run ./main without HIP_VISIBLE_DEVICES, and it uses both cards. Im on Rocm 5.7 and Ubuntu server 22.04.3

Thanks its compiled as per gpu now

Log start
main: build = 1501 (a75fa57)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1699596835
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 ROCm devices:
  Device 0: Radeon RX 7900 XTX, compute capability 11.0
  Device 1: Radeon RX 7900 XT, compute capability 11.0

but ...

Write in python object class star wars and its charactersDesc##################################################################

still getting jibberish

bojak83318 commented 10 months ago

I dont use HIP_VISIBLE_DEVICES, just
make clean
make LLAMA_HIPBLAS=1
and it works for me. And i run ./main without HIP_VISIBLE_DEVICES, and it uses both cards. Im on Rocm 5.7 and Ubuntu server 22.04.3

can you perhaps outline like this guide , from start to finish ... https://www.reddit.com/r/LocalLLaMA/comments/170tghx/guide_installing_rocmhip_for_llamacpp_on_linux/

bojak83318 commented 10 months ago

apparently nvidia also has this problem https://github.com/ggerganov/llama.cpp/issues/3772

ccbadd commented 10 months ago

HIP_VISIBLE_DEVICES is an environment variable read by ROCm during run time and has not reason to be part of the command line for compiling. Simply running "make LLAMA_HIPBLAS=1" is all that is required on my Ubuntu 22.04 server.

xangelix commented 10 months ago

Unclear if this is related, but I can't load any model at all on multigpu ROCm.

Segmentation fault after model load for ROCm multi-gpu, multi-gfx. Best I can remember it worked a couple months ago, but has now been broken at least 2 weeks.

Tested on: Arch Linux Kernel 6.5.9, ROCm 5.7.1, llamacpp https://github.com/ggerganov/llama.cpp/commit/4a4fd3eefad5bd17ab6bcd8e2181b4f62eae76cf

rocminfo

``` ROCk module is loaded ===================== HSA System Attributes ===================== Runtime Version: 1.1 System Timestamp Freq.: 1000.000000MHz Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) Machine Model: LARGE System Endianness: LITTLE Mwaitx: DISABLED DMAbuf Support: YES ========== HSA Agents ========== ******* Agent 1 ******* Name: AMD Ryzen 9 7950X 16-Core Processor Uuid: CPU-XX Marketing Name: AMD Ryzen 9 7950X 16-Core Processor Vendor Name: CPU Feature: None specified Profile: FULL_PROFILE Float Round Mode: NEAR Max Queue Number: 0(0x0) Queue Min Size: 0(0x0) Queue Max Size: 0(0x0) Queue Type: MULTI Node: 0 Device Type: CPU Cache Info: L1: 32768(0x8000) KB Chip ID: 0(0x0) ASIC Revision: 0(0x0) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 6021 BDFID: 0 Internal Node ID: 0 Compute Unit: 32 SIMDs per CU: 0 Shader Engines: 0 Shader Arrs. per Eng.: 0 WatchPts on Addr. Ranges:1 Features: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: FINE GRAINED Size: 65539100(0x3e80c1c) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 2 Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED Size: 65539100(0x3e80c1c) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 3 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 65539100(0x3e80c1c) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE ISA Info: ******* Agent 2 ******* Name: gfx1100 Uuid: GPU-28b5961221d81024 Marketing Name: AMD Radeon RX 7900 XTX Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128(0x80) Queue Min Size: 64(0x40) Queue Max Size: 131072(0x20000) Queue Type: MULTI Node: 1 Device Type: GPU Cache Info: L1: 32(0x20) KB L2: 6144(0x1800) KB L3: 98304(0x18000) KB Chip ID: 29772(0x744c) ASIC Revision: 0(0x0) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 2526 BDFID: 768 Internal Node ID: 1 Compute Unit: 96 SIMDs per CU: 2 Shader Engines: 6 Shader Arrs. per Eng.: 2 WatchPts on Addr. Ranges:4 Features: KERNEL_DISPATCH Fast F16 Operation: TRUE Wavefront Size: 32(0x20) Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Max Waves Per CU: 32(0x20) Max Work-item Per CU: 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) Max fbarriers/Workgrp: 32 Packet Processor uCode:: 528 SDMA engine uCode:: 19 IOMMU Support:: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 25149440(0x17fc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 2 Segment: GLOBAL; FLAGS: Size: 25149440(0x17fc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 3 Segment: GROUP Size: 64(0x40) KB Allocatable: FALSE Alloc Granule: 0KB Alloc Alignment: 0KB Accessible by all: FALSE ISA Info: ISA 1 Name: amdgcn-amd-amdhsa--gfx1100 Machine Models: HSA_MACHINE_MODEL_LARGE Profiles: HSA_PROFILE_BASE Default Rounding Mode: NEAR Default Rounding Mode: NEAR Fast f16: TRUE Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) FBarrier Max Size: 32 ******* Agent 3 ******* Name: gfx1030 Uuid: GPU-8de346d621abe448 Marketing Name: AMD Radeon RX 6900 XT Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128(0x80) Queue Min Size: 64(0x40) Queue Max Size: 131072(0x20000) Queue Type: MULTI Node: 2 Device Type: GPU Cache Info: L1: 16(0x10) KB L2: 4096(0x1000) KB L3: 131072(0x20000) KB Chip ID: 29615(0x73af) ASIC Revision: 1(0x1) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 2720 BDFID: 1792 Internal Node ID: 2 Compute Unit: 80 SIMDs per CU: 2 Shader Engines: 4 Shader Arrs. per Eng.: 2 WatchPts on Addr. Ranges:4 Features: KERNEL_DISPATCH Fast F16 Operation: TRUE Wavefront Size: 32(0x20) Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Max Waves Per CU: 32(0x20) Max Work-item Per CU: 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) Max fbarriers/Workgrp: 32 Packet Processor uCode:: 115 SDMA engine uCode:: 83 IOMMU Support:: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 16760832(0xffc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 2 Segment: GLOBAL; FLAGS: Size: 16760832(0xffc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 3 Segment: GROUP Size: 64(0x40) KB Allocatable: FALSE Alloc Granule: 0KB Alloc Alignment: 0KB Accessible by all: FALSE ISA Info: ISA 1 Name: amdgcn-amd-amdhsa--gfx1030 Machine Models: HSA_MACHINE_MODEL_LARGE Profiles: HSA_PROFILE_BASE Default Rounding Mode: NEAR Default Rounding Mode: NEAR Fast f16: TRUE Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) FBarrier Max Size: 32 *** Done *** ```

make LLAMA_HIPBLAS=1

``` I llama.cpp build info: I UNAME_S: Linux I UNAME_P: unknown I UNAME_M: x86_64 I CFLAGS: -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -march=native -mtune=native I CXXFLAGS: -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native I NVCCFLAGS: -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-pedantic -Xcompiler "-Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native " I LDFLAGS: -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas I CC: cc (GCC) 13.2.1 20230801 I CXX: g++ (GCC) 13.2.1 20230801 cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -march=native -mtune=native -c ggml.c -o ggml.o g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native -c llama.cpp -o llama.o g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native -c common/common.cpp -o common.o g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native -c common/sampling.cpp -o sampling.o g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native -c common/grammar-parser.cpp -o grammar-parser.o g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native -c common/build-info.cpp -o build-info.o g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native -c common/console.cpp -o console.o /opt/rocm/bin/hipcc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native --offload-arch=gfx1100 --offload-arch=gfx1030 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -x hip -c -o ggml-cuda.o ggml-cuda.cu warning: unknown warning option '-Wno-format-truncation' [-Wunknown-warning-option] ggml-cuda.cu:235:41: warning: cast from 'const signed char *' to 'unsigned short *' drops const qualifier [-Wcast-qual] const uint16_t * x16 = (uint16_t *) (x8 + sizeof(int) * i32); // assume at least 2 byte alignment ^ ggml-cuda.cu:245:41: warning: cast from 'const unsigned char *' to 'unsigned short *' drops const qualifier [-Wcast-qual] const uint16_t * x16 = (uint16_t *) (x8 + sizeof(int) * i32); // assume at least 2 byte alignment ^ ggml-cuda.cu:255:22: warning: cast from 'const signed char *' to 'int *' drops const qualifier [-Wcast-qual] return *((int *) (x8 + sizeof(int) * i32)); // assume at least 4 byte alignment ^ ggml-cuda.cu:259:22: warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual] return *((int *) (x8 + sizeof(int) * i32)); // assume at least 4 byte alignment ^ ggml-cuda.cu:467:75: warning: suggest braces around initialization of subobject [-Wmissing-braces] static cudaStream_t g_cudaStreams[GGML_CUDA_MAX_DEVICES][MAX_STREAMS] = { nullptr }; ^~~~~~~ { } ggml-cuda.cu:2227:116: warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q4_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:2227:129: warning: unused parameter 'x_sc' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q4_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:2248:45: warning: cast from 'const void *' to 'block_q4_0 *' drops const qualifier [-Wcast-qual] const block_q4_0 * bx0 = (block_q4_0 *) vx; ^ ggml-cuda.cu:2237:106: warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ ggml-cuda.cu:2238:24: warning: unused parameter 'x_sc' [-Wunused-parameter] int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) { ^ ggml-cuda.cu:2288:37: warning: cast from 'const __half2 *' to 'float *' drops const qualifier [-Wcast-qual] const float * x_dmf = (float *) x_dm; ^ ggml-cuda.cu:2284:94: warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:2284:125: warning: unused parameter 'x_sc' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:2321:116: warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q4_1(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:2321:129: warning: unused parameter 'x_sc' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q4_1(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:2342:45: warning: cast from 'const void *' to 'block_q4_1 *' drops const qualifier [-Wcast-qual] const block_q4_1 * bx0 = (block_q4_1 *) vx; ^ ggml-cuda.cu:2331:106: warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ ggml-cuda.cu:2332:24: warning: unused parameter 'x_sc' [-Wunused-parameter] int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) { ^ ggml-cuda.cu:2375:94: warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:2375:125: warning: unused parameter 'x_sc' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:2413:116: warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q5_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:2413:129: warning: unused parameter 'x_sc' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q5_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:2434:45: warning: cast from 'const void *' to 'block_q5_0 *' drops const qualifier [-Wcast-qual] const block_q5_0 * bx0 = (block_q5_0 *) vx; ^ ggml-cuda.cu:2423:106: warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ ggml-cuda.cu:2424:24: warning: unused parameter 'x_sc' [-Wunused-parameter] int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) { ^ ggml-cuda.cu:2487:94: warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:2487:125: warning: unused parameter 'x_sc' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:2527:116: warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q5_1(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:2527:129: warning: unused parameter 'x_sc' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q5_1(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:2548:45: warning: cast from 'const void *' to 'block_q5_1 *' drops const qualifier [-Wcast-qual] const block_q5_1 * bx0 = (block_q5_1 *) vx; ^ ggml-cuda.cu:2537:106: warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ ggml-cuda.cu:2538:24: warning: unused parameter 'x_sc' [-Wunused-parameter] int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) { ^ ggml-cuda.cu:2598:94: warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:2598:125: warning: unused parameter 'x_sc' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:2633:116: warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q8_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:2633:129: warning: unused parameter 'x_sc' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q8_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:2655:45: warning: cast from 'const void *' to 'block_q8_0 *' drops const qualifier [-Wcast-qual] const block_q8_0 * bx0 = (block_q8_0 *) vx; ^ ggml-cuda.cu:2643:106: warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ ggml-cuda.cu:2644:24: warning: unused parameter 'x_sc' [-Wunused-parameter] int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) { ^ ggml-cuda.cu:2688:94: warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:2688:125: warning: unused parameter 'x_sc' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:2722:116: warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q2_K(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:2745:45: warning: cast from 'const void *' to 'block_q2_K *' drops const qualifier [-Wcast-qual] const block_q2_K * bx0 = (block_q2_K *) vx; ^ ggml-cuda.cu:2734:106: warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ ggml-cuda.cu:2791:94: warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:2866:45: warning: cast from 'const void *' to 'block_q3_K *' drops const qualifier [-Wcast-qual] const block_q3_K * bx0 = (block_q3_K *) vx; ^ ggml-cuda.cu:2947:41: warning: cast from 'const int *' to 'signed char *' drops const qualifier [-Wcast-qual] const int8_t * scales = ((int8_t *) (x_sc + i * (WARP_SIZE/4) + i/4 + kbx*4)) + ky/4; ^ ggml-cuda.cu:3061:116: warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q4_K(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:3084:45: warning: cast from 'const void *' to 'block_q4_K *' drops const qualifier [-Wcast-qual] const block_q4_K * bx0 = (block_q4_K *) vx; ^ ggml-cuda.cu:3129:38: warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual] const int * scales = (int *) bxi->scales; ^ ggml-cuda.cu:3073:106: warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ ggml-cuda.cu:3142:94: warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:3242:116: warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q5_K(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:3265:45: warning: cast from 'const void *' to 'block_q5_K *' drops const qualifier [-Wcast-qual] const block_q5_K * bx0 = (block_q5_K *) vx; ^ ggml-cuda.cu:3321:38: warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual] const int * scales = (int *) bxi->scales; ^ ggml-cuda.cu:3254:106: warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ ggml-cuda.cu:3334:94: warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:3371:116: warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q6_K(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:3394:45: warning: cast from 'const void *' to 'block_q6_K *' drops const qualifier [-Wcast-qual] const block_q6_K * bx0 = (block_q6_K *) vx; ^ ggml-cuda.cu:3383:106: warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ ggml-cuda.cu:3454:94: warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:6947:26: warning: unused variable 'size_src0_ddq' [-Wunused-variable] const size_t size_src0_ddq = split ? (row_high[id]-row_low[id])*ne00 * src0_ts/src0_bs : ggml_nbytes(src0); ^ ggml-cuda.cu:6846:19: warning: unused variable 'nrows0' [-Wunused-variable] const int64_t nrows0 = ggml_nrows(src0); ^ ggml-cuda.cu:8135:103: warning: function 'ggml_backend_cuda_graph_plan_free' could be declared with attribute 'noreturn' [-Wmissing-noreturn] static void ggml_backend_cuda_graph_plan_free(ggml_backend_t backend, ggml_backend_graph_plan_t plan) { ^ ggml-cuda.cu:8142:106: warning: function 'ggml_backend_cuda_graph_plan_compute' could be declared with attribute 'noreturn' [-Wmissing-noreturn] static void ggml_backend_cuda_graph_plan_compute(ggml_backend_t backend, ggml_backend_graph_plan_t plan) { ^ ggml-cuda.cu:2248:45: warning: cast from 'const void *' to 'block_q4_0 *' drops const qualifier [-Wcast-qual] const block_q4_0 * bx0 = (block_q4_0 *) vx; ^ ggml-cuda.cu:3618:9: note: in instantiation of function template specialization 'load_tiles_q4_0<128, 8, false>' requested here load_tiles_q4_0, VDR_Q4_0_Q8_1_MMQ, vec_dot_q4_0_q8_1_mul_mat> ^ ggml-cuda.cu:5158:9: note: in instantiation of function template specialization 'mul_mat_q4_0' requested here mul_mat_q4_0<<>> ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:3617:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q4_0, 64, 128, 8, &allocate_tiles_q4_0, &load_tiles_q4_0, 4, &vec_dot_q4_0_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5158:9: note: in instantiation of function template specialization 'mul_mat_q4_0' requested here mul_mat_q4_0<<>> ^ ggml-cuda.cu:2248:45: warning: cast from 'const void *' to 'block_q4_0 *' drops const qualifier [-Wcast-qual] const block_q4_0 * bx0 = (block_q4_0 *) vx; ^ ggml-cuda.cu:3618:9: note: in instantiation of function template specialization 'load_tiles_q4_0<128, 8, true>' requested here load_tiles_q4_0, VDR_Q4_0_Q8_1_MMQ, vec_dot_q4_0_q8_1_mul_mat> ^ ggml-cuda.cu:5162:9: note: in instantiation of function template specialization 'mul_mat_q4_0' requested here mul_mat_q4_0<<>> ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:3617:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q4_0, 64, 128, 8, &allocate_tiles_q4_0, &load_tiles_q4_0, 4, &vec_dot_q4_0_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5162:9: note: in instantiation of function template specialization 'mul_mat_q4_0' requested here mul_mat_q4_0<<>> ^ ggml-cuda.cu:2342:45: warning: cast from 'const void *' to 'block_q4_1 *' drops const qualifier [-Wcast-qual] const block_q4_1 * bx0 = (block_q4_1 *) vx; ^ ggml-cuda.cu:3687:9: note: in instantiation of function template specialization 'load_tiles_q4_1<128, 8, false>' requested here load_tiles_q4_1, VDR_Q4_1_Q8_1_MMQ, vec_dot_q4_1_q8_1_mul_mat> ^ ggml-cuda.cu:5203:9: note: in instantiation of function template specialization 'mul_mat_q4_1' requested here mul_mat_q4_1<<>> ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:3686:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q4_1, 64, 128, 8, &allocate_tiles_q4_1, &load_tiles_q4_1, 4, &vec_dot_q4_1_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5203:9: note: in instantiation of function template specialization 'mul_mat_q4_1' requested here mul_mat_q4_1<<>> ^ ggml-cuda.cu:2342:45: warning: cast from 'const void *' to 'block_q4_1 *' drops const qualifier [-Wcast-qual] const block_q4_1 * bx0 = (block_q4_1 *) vx; ^ ggml-cuda.cu:3687:9: note: in instantiation of function template specialization 'load_tiles_q4_1<128, 8, true>' requested here load_tiles_q4_1, VDR_Q4_1_Q8_1_MMQ, vec_dot_q4_1_q8_1_mul_mat> ^ ggml-cuda.cu:5207:9: note: in instantiation of function template specialization 'mul_mat_q4_1' requested here mul_mat_q4_1<<>> ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:3686:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q4_1, 64, 128, 8, &allocate_tiles_q4_1, &load_tiles_q4_1, 4, &vec_dot_q4_1_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5207:9: note: in instantiation of function template specialization 'mul_mat_q4_1' requested here mul_mat_q4_1<<>> ^ ggml-cuda.cu:2434:45: warning: cast from 'const void *' to 'block_q5_0 *' drops const qualifier [-Wcast-qual] const block_q5_0 * bx0 = (block_q5_0 *) vx; ^ ggml-cuda.cu:3754:9: note: in instantiation of function template specialization 'load_tiles_q5_0<128, 8, false>' requested here load_tiles_q5_0, VDR_Q5_0_Q8_1_MMQ, vec_dot_q5_0_q8_1_mul_mat> ^ ggml-cuda.cu:5248:9: note: in instantiation of function template specialization 'mul_mat_q5_0' requested here mul_mat_q5_0<<>> ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:3753:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, false, block_q5_0, 64, 128, 8, &allocate_tiles_q5_0, &load_tiles_q5_0, 4, &vec_dot_q5_0_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5248:9: note: in instantiation of function template specialization 'mul_mat_q5_0' requested here mul_mat_q5_0<<>> ^ ggml-cuda.cu:2434:45: warning: cast from 'const void *' to 'block_q5_0 *' drops const qualifier [-Wcast-qual] const block_q5_0 * bx0 = (block_q5_0 *) vx; ^ ggml-cuda.cu:3754:9: note: in instantiation of function template specialization 'load_tiles_q5_0<128, 8, true>' requested here load_tiles_q5_0, VDR_Q5_0_Q8_1_MMQ, vec_dot_q5_0_q8_1_mul_mat> ^ ggml-cuda.cu:5252:9: note: in instantiation of function template specialization 'mul_mat_q5_0' requested here mul_mat_q5_0<<>> ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:3753:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, false, block_q5_0, 64, 128, 8, &allocate_tiles_q5_0, &load_tiles_q5_0, 4, &vec_dot_q5_0_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5252:9: note: in instantiation of function template specialization 'mul_mat_q5_0' requested here mul_mat_q5_0<<>> ^ ggml-cuda.cu:2548:45: warning: cast from 'const void *' to 'block_q5_1 *' drops const qualifier [-Wcast-qual] const block_q5_1 * bx0 = (block_q5_1 *) vx; ^ ggml-cuda.cu:3821:9: note: in instantiation of function template specialization 'load_tiles_q5_1<128, 8, false>' requested here load_tiles_q5_1, VDR_Q5_1_Q8_1_MMQ, vec_dot_q5_1_q8_1_mul_mat> ^ ggml-cuda.cu:5293:9: note: in instantiation of function template specialization 'mul_mat_q5_1' requested here mul_mat_q5_1<<>> ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:3820:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q5_1, 64, 128, 8, &allocate_tiles_q5_1, &load_tiles_q5_1, 4, &vec_dot_q5_1_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5293:9: note: in instantiation of function template specialization 'mul_mat_q5_1' requested here mul_mat_q5_1<<>> ^ ggml-cuda.cu:2548:45: warning: cast from 'const void *' to 'block_q5_1 *' drops const qualifier [-Wcast-qual] const block_q5_1 * bx0 = (block_q5_1 *) vx; ^ ggml-cuda.cu:3821:9: note: in instantiation of function template specialization 'load_tiles_q5_1<128, 8, true>' requested here load_tiles_q5_1, VDR_Q5_1_Q8_1_MMQ, vec_dot_q5_1_q8_1_mul_mat> ^ ggml-cuda.cu:5297:9: note: in instantiation of function template specialization 'mul_mat_q5_1' requested here mul_mat_q5_1<<>> ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:3820:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q5_1, 64, 128, 8, &allocate_tiles_q5_1, &load_tiles_q5_1, 4, &vec_dot_q5_1_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5297:9: note: in instantiation of function template specialization 'mul_mat_q5_1' requested here mul_mat_q5_1<<>> ^ ggml-cuda.cu:2655:45: warning: cast from 'const void *' to 'block_q8_0 *' drops const qualifier [-Wcast-qual] const block_q8_0 * bx0 = (block_q8_0 *) vx; ^ ggml-cuda.cu:3888:9: note: in instantiation of function template specialization 'load_tiles_q8_0<128, 8, false>' requested here load_tiles_q8_0, VDR_Q8_0_Q8_1_MMQ, vec_dot_q8_0_q8_1_mul_mat> ^ ggml-cuda.cu:5338:9: note: in instantiation of function template specialization 'mul_mat_q8_0' requested here mul_mat_q8_0<<>> ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:3887:5: note: in instantiation of function template specialization 'mul_mat_q<32, 1, 8, false, block_q8_0, 64, 128, 8, &allocate_tiles_q8_0, &load_tiles_q8_0, 8, &vec_dot_q8_0_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5338:9: note: in instantiation of function template specialization 'mul_mat_q8_0' requested here mul_mat_q8_0<<>> ^ ggml-cuda.cu:2655:45: warning: cast from 'const void *' to 'block_q8_0 *' drops const qualifier [-Wcast-qual] const block_q8_0 * bx0 = (block_q8_0 *) vx; ^ ggml-cuda.cu:3888:9: note: in instantiation of function template specialization 'load_tiles_q8_0<128, 8, true>' requested here load_tiles_q8_0, VDR_Q8_0_Q8_1_MMQ, vec_dot_q8_0_q8_1_mul_mat> ^ ggml-cuda.cu:5342:9: note: in instantiation of function template specialization 'mul_mat_q8_0' requested here mul_mat_q8_0<<>> ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:3887:5: note: in instantiation of function template specialization 'mul_mat_q<32, 1, 8, false, block_q8_0, 64, 128, 8, &allocate_tiles_q8_0, &load_tiles_q8_0, 8, &vec_dot_q8_0_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5342:9: note: in instantiation of function template specialization 'mul_mat_q8_0' requested here mul_mat_q8_0<<>> ^ ggml-cuda.cu:2745:45: warning: cast from 'const void *' to 'block_q2_K *' drops const qualifier [-Wcast-qual] const block_q2_K * bx0 = (block_q2_K *) vx; ^ ggml-cuda.cu:3955:9: note: in instantiation of function template specialization 'load_tiles_q2_K<128, 8, false>' requested here load_tiles_q2_K, VDR_Q2_K_Q8_1_MMQ, vec_dot_q2_K_q8_1_mul_mat> ^ ggml-cuda.cu:5383:9: note: in instantiation of function template specialization 'mul_mat_q2_K' requested here mul_mat_q2_K<<>> ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:3954:5: note: in instantiation of function template specialization 'mul_mat_q<256, 4, 16, false, block_q2_K, 64, 128, 8, &allocate_tiles_q2_K, &load_tiles_q2_K, 2, &vec_dot_q2_K_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5383:9: note: in instantiation of function template specialization 'mul_mat_q2_K' requested here mul_mat_q2_K<<>> ^ ggml-cuda.cu:2745:45: warning: cast from 'const void *' to 'block_q2_K *' drops const qualifier [-Wcast-qual] const block_q2_K * bx0 = (block_q2_K *) vx; ^ ggml-cuda.cu:3955:9: note: in instantiation of function template specialization 'load_tiles_q2_K<128, 8, true>' requested here load_tiles_q2_K, VDR_Q2_K_Q8_1_MMQ, vec_dot_q2_K_q8_1_mul_mat> ^ ggml-cuda.cu:5387:9: note: in instantiation of function template specialization 'mul_mat_q2_K' requested here mul_mat_q2_K<<>> ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:3954:5: note: in instantiation of function template specialization 'mul_mat_q<256, 4, 16, false, block_q2_K, 64, 128, 8, &allocate_tiles_q2_K, &load_tiles_q2_K, 2, &vec_dot_q2_K_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5387:9: note: in instantiation of function template specialization 'mul_mat_q2_K' requested here mul_mat_q2_K<<>> ^ ggml-cuda.cu:2866:45: warning: cast from 'const void *' to 'block_q3_K *' drops const qualifier [-Wcast-qual] const block_q3_K * bx0 = (block_q3_K *) vx; ^ ggml-cuda.cu:4024:9: note: in instantiation of function template specialization 'load_tiles_q3_K<64, 8, false>' requested here load_tiles_q3_K, VDR_Q3_K_Q8_1_MMQ, vec_dot_q3_K_q8_1_mul_mat> ^ ggml-cuda.cu:5430:9: note: in instantiation of function template specialization 'mul_mat_q3_K' requested here mul_mat_q3_K<<>> ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:4023:5: note: in instantiation of function template specialization 'mul_mat_q<256, 4, 16, false, block_q3_K, 128, 64, 8, &allocate_tiles_q3_K, &load_tiles_q3_K, 2, &vec_dot_q3_K_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5430:9: note: in instantiation of function template specialization 'mul_mat_q3_K' requested here mul_mat_q3_K<<>> ^ ggml-cuda.cu:2866:45: warning: cast from 'const void *' to 'block_q3_K *' drops const qualifier [-Wcast-qual] const block_q3_K * bx0 = (block_q3_K *) vx; ^ ggml-cuda.cu:4024:9: note: in instantiation of function template specialization 'load_tiles_q3_K<64, 8, true>' requested here load_tiles_q3_K, VDR_Q3_K_Q8_1_MMQ, vec_dot_q3_K_q8_1_mul_mat> ^ ggml-cuda.cu:5434:9: note: in instantiation of function template specialization 'mul_mat_q3_K' requested here mul_mat_q3_K<<>> ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:4023:5: note: in instantiation of function template specialization 'mul_mat_q<256, 4, 16, false, block_q3_K, 128, 64, 8, &allocate_tiles_q3_K, &load_tiles_q3_K, 2, &vec_dot_q3_K_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5434:9: note: in instantiation of function template specialization 'mul_mat_q3_K' requested here mul_mat_q3_K<<>> ^ ggml-cuda.cu:3084:45: warning: cast from 'const void *' to 'block_q4_K *' drops const qualifier [-Wcast-qual] const block_q4_K * bx0 = (block_q4_K *) vx; ^ ggml-cuda.cu:4093:9: note: in instantiation of function template specialization 'load_tiles_q4_K<128, 8, false>' requested here load_tiles_q4_K, VDR_Q4_K_Q8_1_MMQ, vec_dot_q4_K_q8_1_mul_mat> ^ ggml-cuda.cu:5476:9: note: in instantiation of function template specialization 'mul_mat_q4_K' requested here mul_mat_q4_K<<>> ^ ggml-cuda.cu:3129:38: warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual] const int * scales = (int *) bxi->scales; ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:4092:5: note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, true, block_q4_K, 64, 128, 8, &allocate_tiles_q4_K, &load_tiles_q4_K, 8, &vec_dot_q4_K_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5476:9: note: in instantiation of function template specialization 'mul_mat_q4_K' requested here mul_mat_q4_K<<>> ^ ggml-cuda.cu:3084:45: warning: cast from 'const void *' to 'block_q4_K *' drops const qualifier [-Wcast-qual] const block_q4_K * bx0 = (block_q4_K *) vx; ^ ggml-cuda.cu:4093:9: note: in instantiation of function template specialization 'load_tiles_q4_K<128, 8, true>' requested here load_tiles_q4_K, VDR_Q4_K_Q8_1_MMQ, vec_dot_q4_K_q8_1_mul_mat> ^ ggml-cuda.cu:5480:9: note: in instantiation of function template specialization 'mul_mat_q4_K' requested here mul_mat_q4_K<<>> ^ ggml-cuda.cu:3129:38: warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual] const int * scales = (int *) bxi->scales; ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:4092:5: note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, true, block_q4_K, 64, 128, 8, &allocate_tiles_q4_K, &load_tiles_q4_K, 8, &vec_dot_q4_K_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5480:9: note: in instantiation of function template specialization 'mul_mat_q4_K' requested here mul_mat_q4_K<<>> ^ ggml-cuda.cu:3265:45: warning: cast from 'const void *' to 'block_q5_K *' drops const qualifier [-Wcast-qual] const block_q5_K * bx0 = (block_q5_K *) vx; ^ ggml-cuda.cu:4160:9: note: in instantiation of function template specialization 'load_tiles_q5_K<128, 8, false>' requested here load_tiles_q5_K, VDR_Q5_K_Q8_1_MMQ, vec_dot_q5_K_q8_1_mul_mat> ^ ggml-cuda.cu:5521:9: note: in instantiation of function template specialization 'mul_mat_q5_K' requested here mul_mat_q5_K<<>> ^ ggml-cuda.cu:3321:38: warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual] const int * scales = (int *) bxi->scales; ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:4159:5: note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, true, block_q5_K, 64, 128, 8, &allocate_tiles_q5_K, &load_tiles_q5_K, 8, &vec_dot_q5_K_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5521:9: note: in instantiation of function template specialization 'mul_mat_q5_K' requested here mul_mat_q5_K<<>> ^ ggml-cuda.cu:3265:45: warning: cast from 'const void *' to 'block_q5_K *' drops const qualifier [-Wcast-qual] const block_q5_K * bx0 = (block_q5_K *) vx; ^ ggml-cuda.cu:4160:9: note: in instantiation of function template specialization 'load_tiles_q5_K<128, 8, true>' requested here load_tiles_q5_K, VDR_Q5_K_Q8_1_MMQ, vec_dot_q5_K_q8_1_mul_mat> ^ ggml-cuda.cu:5525:9: note: in instantiation of function template specialization 'mul_mat_q5_K' requested here mul_mat_q5_K<<>> ^ ggml-cuda.cu:3321:38: warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual] const int * scales = (int *) bxi->scales; ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:4159:5: note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, true, block_q5_K, 64, 128, 8, &allocate_tiles_q5_K, &load_tiles_q5_K, 8, &vec_dot_q5_K_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5525:9: note: in instantiation of function template specialization 'mul_mat_q5_K' requested here mul_mat_q5_K<<>> ^ ggml-cuda.cu:3394:45: warning: cast from 'const void *' to 'block_q6_K *' drops const qualifier [-Wcast-qual] const block_q6_K * bx0 = (block_q6_K *) vx; ^ ggml-cuda.cu:4229:9: note: in instantiation of function template specialization 'load_tiles_q6_K<128, 8, false>' requested here load_tiles_q6_K, VDR_Q6_K_Q8_1_MMQ, vec_dot_q6_K_q8_1_mul_mat> ^ ggml-cuda.cu:5566:9: note: in instantiation of function template specialization 'mul_mat_q6_K' requested here mul_mat_q6_K<<>> ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:4228:5: note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, false, block_q6_K, 64, 128, 8, &allocate_tiles_q6_K, &load_tiles_q6_K, 8, &vec_dot_q6_K_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5566:9: note: in instantiation of function template specialization 'mul_mat_q6_K' requested here mul_mat_q6_K<<>> ^ ggml-cuda.cu:3394:45: warning: cast from 'const void *' to 'block_q6_K *' drops const qualifier [-Wcast-qual] const block_q6_K * bx0 = (block_q6_K *) vx; ^ ggml-cuda.cu:4229:9: note: in instantiation of function template specialization 'load_tiles_q6_K<128, 8, true>' requested here load_tiles_q6_K, VDR_Q6_K_Q8_1_MMQ, vec_dot_q6_K_q8_1_mul_mat> ^ ggml-cuda.cu:5570:9: note: in instantiation of function template specialization 'mul_mat_q6_K' requested here mul_mat_q6_K<<>> ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:4228:5: note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, false, block_q6_K, 64, 128, 8, &allocate_tiles_q6_K, &load_tiles_q6_K, 8, &vec_dot_q6_K_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5570:9: note: in instantiation of function template specialization 'mul_mat_q6_K' requested here mul_mat_q6_K<<>> ^ 110 warnings generated when compiling for gfx1030. warning: unknown warning option '-Wno-format-truncation' [-Wunknown-warning-option] ggml-cuda.cu:235:41: warning: cast from 'const signed char *' to 'unsigned short *' drops const qualifier [-Wcast-qual] const uint16_t * x16 = (uint16_t *) (x8 + sizeof(int) * i32); // assume at least 2 byte alignment ^ ggml-cuda.cu:245:41: warning: cast from 'const unsigned char *' to 'unsigned short *' drops const qualifier [-Wcast-qual] const uint16_t * x16 = (uint16_t *) (x8 + sizeof(int) * i32); // assume at least 2 byte alignment ^ ggml-cuda.cu:255:22: warning: cast from 'const signed char *' to 'int *' drops const qualifier [-Wcast-qual] return *((int *) (x8 + sizeof(int) * i32)); // assume at least 4 byte alignment ^ ggml-cuda.cu:259:22: warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual] return *((int *) (x8 + sizeof(int) * i32)); // assume at least 4 byte alignment ^ ggml-cuda.cu:467:75: warning: suggest braces around initialization of subobject [-Wmissing-braces] static cudaStream_t g_cudaStreams[GGML_CUDA_MAX_DEVICES][MAX_STREAMS] = { nullptr }; ^~~~~~~ { } ggml-cuda.cu:2227:116: warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q4_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:2227:129: warning: unused parameter 'x_sc' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q4_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:2248:45: warning: cast from 'const void *' to 'block_q4_0 *' drops const qualifier [-Wcast-qual] const block_q4_0 * bx0 = (block_q4_0 *) vx; ^ ggml-cuda.cu:2237:106: warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ ggml-cuda.cu:2238:24: warning: unused parameter 'x_sc' [-Wunused-parameter] int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) { ^ ggml-cuda.cu:2288:37: warning: cast from 'const __half2 *' to 'float *' drops const qualifier [-Wcast-qual] const float * x_dmf = (float *) x_dm; ^ ggml-cuda.cu:2284:94: warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:2284:125: warning: unused parameter 'x_sc' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:2321:116: warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q4_1(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:2321:129: warning: unused parameter 'x_sc' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q4_1(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:2342:45: warning: cast from 'const void *' to 'block_q4_1 *' drops const qualifier [-Wcast-qual] const block_q4_1 * bx0 = (block_q4_1 *) vx; ^ ggml-cuda.cu:2331:106: warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ ggml-cuda.cu:2332:24: warning: unused parameter 'x_sc' [-Wunused-parameter] int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) { ^ ggml-cuda.cu:2375:94: warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:2375:125: warning: unused parameter 'x_sc' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:2413:116: warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q5_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:2413:129: warning: unused parameter 'x_sc' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q5_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:2434:45: warning: cast from 'const void *' to 'block_q5_0 *' drops const qualifier [-Wcast-qual] const block_q5_0 * bx0 = (block_q5_0 *) vx; ^ ggml-cuda.cu:2423:106: warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ ggml-cuda.cu:2424:24: warning: unused parameter 'x_sc' [-Wunused-parameter] int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) { ^ ggml-cuda.cu:2487:94: warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:2487:125: warning: unused parameter 'x_sc' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:2527:116: warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q5_1(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:2527:129: warning: unused parameter 'x_sc' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q5_1(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:2548:45: warning: cast from 'const void *' to 'block_q5_1 *' drops const qualifier [-Wcast-qual] const block_q5_1 * bx0 = (block_q5_1 *) vx; ^ ggml-cuda.cu:2537:106: warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ ggml-cuda.cu:2538:24: warning: unused parameter 'x_sc' [-Wunused-parameter] int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) { ^ ggml-cuda.cu:2598:94: warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:2598:125: warning: unused parameter 'x_sc' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:2633:116: warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q8_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:2633:129: warning: unused parameter 'x_sc' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q8_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:2655:45: warning: cast from 'const void *' to 'block_q8_0 *' drops const qualifier [-Wcast-qual] const block_q8_0 * bx0 = (block_q8_0 *) vx; ^ ggml-cuda.cu:2643:106: warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ ggml-cuda.cu:2644:24: warning: unused parameter 'x_sc' [-Wunused-parameter] int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) { ^ ggml-cuda.cu:2688:94: warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:2688:125: warning: unused parameter 'x_sc' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:2722:116: warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q2_K(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:2745:45: warning: cast from 'const void *' to 'block_q2_K *' drops const qualifier [-Wcast-qual] const block_q2_K * bx0 = (block_q2_K *) vx; ^ ggml-cuda.cu:2734:106: warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ ggml-cuda.cu:2791:94: warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:2866:45: warning: cast from 'const void *' to 'block_q3_K *' drops const qualifier [-Wcast-qual] const block_q3_K * bx0 = (block_q3_K *) vx; ^ ggml-cuda.cu:2947:41: warning: cast from 'const int *' to 'signed char *' drops const qualifier [-Wcast-qual] const int8_t * scales = ((int8_t *) (x_sc + i * (WARP_SIZE/4) + i/4 + kbx*4)) + ky/4; ^ ggml-cuda.cu:3061:116: warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q4_K(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:3084:45: warning: cast from 'const void *' to 'block_q4_K *' drops const qualifier [-Wcast-qual] const block_q4_K * bx0 = (block_q4_K *) vx; ^ ggml-cuda.cu:3129:38: warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual] const int * scales = (int *) bxi->scales; ^ ggml-cuda.cu:3073:106: warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ ggml-cuda.cu:3142:94: warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:3242:116: warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q5_K(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:3265:45: warning: cast from 'const void *' to 'block_q5_K *' drops const qualifier [-Wcast-qual] const block_q5_K * bx0 = (block_q5_K *) vx; ^ ggml-cuda.cu:3321:38: warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual] const int * scales = (int *) bxi->scales; ^ ggml-cuda.cu:3254:106: warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ ggml-cuda.cu:3334:94: warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:3371:116: warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q6_K(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:3394:45: warning: cast from 'const void *' to 'block_q6_K *' drops const qualifier [-Wcast-qual] const block_q6_K * bx0 = (block_q6_K *) vx; ^ ggml-cuda.cu:3383:106: warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ ggml-cuda.cu:3454:94: warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:6947:26: warning: unused variable 'size_src0_ddq' [-Wunused-variable] const size_t size_src0_ddq = split ? (row_high[id]-row_low[id])*ne00 * src0_ts/src0_bs : ggml_nbytes(src0); ^ ggml-cuda.cu:6846:19: warning: unused variable 'nrows0' [-Wunused-variable] const int64_t nrows0 = ggml_nrows(src0); ^ ggml-cuda.cu:8135:103: warning: function 'ggml_backend_cuda_graph_plan_free' could be declared with attribute 'noreturn' [-Wmissing-noreturn] static void ggml_backend_cuda_graph_plan_free(ggml_backend_t backend, ggml_backend_graph_plan_t plan) { ^ ggml-cuda.cu:8142:106: warning: function 'ggml_backend_cuda_graph_plan_compute' could be declared with attribute 'noreturn' [-Wmissing-noreturn] static void ggml_backend_cuda_graph_plan_compute(ggml_backend_t backend, ggml_backend_graph_plan_t plan) { ^ ggml-cuda.cu:2248:45: warning: cast from 'const void *' to 'block_q4_0 *' drops const qualifier [-Wcast-qual] const block_q4_0 * bx0 = (block_q4_0 *) vx; ^ ggml-cuda.cu:3618:9: note: in instantiation of function template specialization 'load_tiles_q4_0<128, 8, false>' requested here load_tiles_q4_0, VDR_Q4_0_Q8_1_MMQ, vec_dot_q4_0_q8_1_mul_mat> ^ ggml-cuda.cu:5158:9: note: in instantiation of function template specialization 'mul_mat_q4_0' requested here mul_mat_q4_0<<>> ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:3617:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q4_0, 64, 128, 8, &allocate_tiles_q4_0, &load_tiles_q4_0, 4, &vec_dot_q4_0_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5158:9: note: in instantiation of function template specialization 'mul_mat_q4_0' requested here mul_mat_q4_0<<>> ^ ggml-cuda.cu:2248:45: warning: cast from 'const void *' to 'block_q4_0 *' drops const qualifier [-Wcast-qual] const block_q4_0 * bx0 = (block_q4_0 *) vx; ^ ggml-cuda.cu:3618:9: note: in instantiation of function template specialization 'load_tiles_q4_0<128, 8, true>' requested here load_tiles_q4_0, VDR_Q4_0_Q8_1_MMQ, vec_dot_q4_0_q8_1_mul_mat> ^ ggml-cuda.cu:5162:9: note: in instantiation of function template specialization 'mul_mat_q4_0' requested here mul_mat_q4_0<<>> ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:3617:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q4_0, 64, 128, 8, &allocate_tiles_q4_0, &load_tiles_q4_0, 4, &vec_dot_q4_0_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5162:9: note: in instantiation of function template specialization 'mul_mat_q4_0' requested here mul_mat_q4_0<<>> ^ ggml-cuda.cu:2342:45: warning: cast from 'const void *' to 'block_q4_1 *' drops const qualifier [-Wcast-qual] const block_q4_1 * bx0 = (block_q4_1 *) vx; ^ ggml-cuda.cu:3687:9: note: in instantiation of function template specialization 'load_tiles_q4_1<128, 8, false>' requested here load_tiles_q4_1, VDR_Q4_1_Q8_1_MMQ, vec_dot_q4_1_q8_1_mul_mat> ^ ggml-cuda.cu:5203:9: note: in instantiation of function template specialization 'mul_mat_q4_1' requested here mul_mat_q4_1<<>> ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:3686:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q4_1, 64, 128, 8, &allocate_tiles_q4_1, &load_tiles_q4_1, 4, &vec_dot_q4_1_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5203:9: note: in instantiation of function template specialization 'mul_mat_q4_1' requested here mul_mat_q4_1<<>> ^ ggml-cuda.cu:2342:45: warning: cast from 'const void *' to 'block_q4_1 *' drops const qualifier [-Wcast-qual] const block_q4_1 * bx0 = (block_q4_1 *) vx; ^ ggml-cuda.cu:3687:9: note: in instantiation of function template specialization 'load_tiles_q4_1<128, 8, true>' requested here load_tiles_q4_1, VDR_Q4_1_Q8_1_MMQ, vec_dot_q4_1_q8_1_mul_mat> ^ ggml-cuda.cu:5207:9: note: in instantiation of function template specialization 'mul_mat_q4_1' requested here mul_mat_q4_1<<>> ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:3686:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q4_1, 64, 128, 8, &allocate_tiles_q4_1, &load_tiles_q4_1, 4, &vec_dot_q4_1_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5207:9: note: in instantiation of function template specialization 'mul_mat_q4_1' requested here mul_mat_q4_1<<>> ^ ggml-cuda.cu:2434:45: warning: cast from 'const void *' to 'block_q5_0 *' drops const qualifier [-Wcast-qual] const block_q5_0 * bx0 = (block_q5_0 *) vx; ^ ggml-cuda.cu:3754:9: note: in instantiation of function template specialization 'load_tiles_q5_0<128, 8, false>' requested here load_tiles_q5_0, VDR_Q5_0_Q8_1_MMQ, vec_dot_q5_0_q8_1_mul_mat> ^ ggml-cuda.cu:5248:9: note: in instantiation of function template specialization 'mul_mat_q5_0' requested here mul_mat_q5_0<<>> ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:3753:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, false, block_q5_0, 64, 128, 8, &allocate_tiles_q5_0, &load_tiles_q5_0, 4, &vec_dot_q5_0_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5248:9: note: in instantiation of function template specialization 'mul_mat_q5_0' requested here mul_mat_q5_0<<>> ^ ggml-cuda.cu:2434:45: warning: cast from 'const void *' to 'block_q5_0 *' drops const qualifier [-Wcast-qual] const block_q5_0 * bx0 = (block_q5_0 *) vx; ^ ggml-cuda.cu:3754:9: note: in instantiation of function template specialization 'load_tiles_q5_0<128, 8, true>' requested here load_tiles_q5_0, VDR_Q5_0_Q8_1_MMQ, vec_dot_q5_0_q8_1_mul_mat> ^ ggml-cuda.cu:5252:9: note: in instantiation of function template specialization 'mul_mat_q5_0' requested here mul_mat_q5_0<<>> ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:3753:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, false, block_q5_0, 64, 128, 8, &allocate_tiles_q5_0, &load_tiles_q5_0, 4, &vec_dot_q5_0_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5252:9: note: in instantiation of function template specialization 'mul_mat_q5_0' requested here mul_mat_q5_0<<>> ^ ggml-cuda.cu:2548:45: warning: cast from 'const void *' to 'block_q5_1 *' drops const qualifier [-Wcast-qual] const block_q5_1 * bx0 = (block_q5_1 *) vx; ^ ggml-cuda.cu:3821:9: note: in instantiation of function template specialization 'load_tiles_q5_1<128, 8, false>' requested here load_tiles_q5_1, VDR_Q5_1_Q8_1_MMQ, vec_dot_q5_1_q8_1_mul_mat> ^ ggml-cuda.cu:5293:9: note: in instantiation of function template specialization 'mul_mat_q5_1' requested here mul_mat_q5_1<<>> ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:3820:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q5_1, 64, 128, 8, &allocate_tiles_q5_1, &load_tiles_q5_1, 4, &vec_dot_q5_1_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5293:9: note: in instantiation of function template specialization 'mul_mat_q5_1' requested here mul_mat_q5_1<<>> ^ ggml-cuda.cu:2548:45: warning: cast from 'const void *' to 'block_q5_1 *' drops const qualifier [-Wcast-qual] const block_q5_1 * bx0 = (block_q5_1 *) vx; ^ ggml-cuda.cu:3821:9: note: in instantiation of function template specialization 'load_tiles_q5_1<128, 8, true>' requested here load_tiles_q5_1, VDR_Q5_1_Q8_1_MMQ, vec_dot_q5_1_q8_1_mul_mat> ^ ggml-cuda.cu:5297:9: note: in instantiation of function template specialization 'mul_mat_q5_1' requested here mul_mat_q5_1<<>> ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:3820:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q5_1, 64, 128, 8, &allocate_tiles_q5_1, &load_tiles_q5_1, 4, &vec_dot_q5_1_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5297:9: note: in instantiation of function template specialization 'mul_mat_q5_1' requested here mul_mat_q5_1<<>> ^ ggml-cuda.cu:2655:45: warning: cast from 'const void *' to 'block_q8_0 *' drops const qualifier [-Wcast-qual] const block_q8_0 * bx0 = (block_q8_0 *) vx; ^ ggml-cuda.cu:3888:9: note: in instantiation of function template specialization 'load_tiles_q8_0<128, 8, false>' requested here load_tiles_q8_0, VDR_Q8_0_Q8_1_MMQ, vec_dot_q8_0_q8_1_mul_mat> ^ ggml-cuda.cu:5338:9: note: in instantiation of function template specialization 'mul_mat_q8_0' requested here mul_mat_q8_0<<>> ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:3887:5: note: in instantiation of function template specialization 'mul_mat_q<32, 1, 8, false, block_q8_0, 64, 128, 8, &allocate_tiles_q8_0, &load_tiles_q8_0, 8, &vec_dot_q8_0_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5338:9: note: in instantiation of function template specialization 'mul_mat_q8_0' requested here mul_mat_q8_0<<>> ^ ggml-cuda.cu:2655:45: warning: cast from 'const void *' to 'block_q8_0 *' drops const qualifier [-Wcast-qual] const block_q8_0 * bx0 = (block_q8_0 *) vx; ^ ggml-cuda.cu:3888:9: note: in instantiation of function template specialization 'load_tiles_q8_0<128, 8, true>' requested here load_tiles_q8_0, VDR_Q8_0_Q8_1_MMQ, vec_dot_q8_0_q8_1_mul_mat> ^ ggml-cuda.cu:5342:9: note: in instantiation of function template specialization 'mul_mat_q8_0' requested here mul_mat_q8_0<<>> ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:3887:5: note: in instantiation of function template specialization 'mul_mat_q<32, 1, 8, false, block_q8_0, 64, 128, 8, &allocate_tiles_q8_0, &load_tiles_q8_0, 8, &vec_dot_q8_0_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5342:9: note: in instantiation of function template specialization 'mul_mat_q8_0' requested here mul_mat_q8_0<<>> ^ ggml-cuda.cu:2745:45: warning: cast from 'const void *' to 'block_q2_K *' drops const qualifier [-Wcast-qual] const block_q2_K * bx0 = (block_q2_K *) vx; ^ ggml-cuda.cu:3955:9: note: in instantiation of function template specialization 'load_tiles_q2_K<128, 8, false>' requested here load_tiles_q2_K, VDR_Q2_K_Q8_1_MMQ, vec_dot_q2_K_q8_1_mul_mat> ^ ggml-cuda.cu:5383:9: note: in instantiation of function template specialization 'mul_mat_q2_K' requested here mul_mat_q2_K<<>> ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:3954:5: note: in instantiation of function template specialization 'mul_mat_q<256, 4, 16, false, block_q2_K, 64, 128, 8, &allocate_tiles_q2_K, &load_tiles_q2_K, 2, &vec_dot_q2_K_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5383:9: note: in instantiation of function template specialization 'mul_mat_q2_K' requested here mul_mat_q2_K<<>> ^ ggml-cuda.cu:2745:45: warning: cast from 'const void *' to 'block_q2_K *' drops const qualifier [-Wcast-qual] const block_q2_K * bx0 = (block_q2_K *) vx; ^ ggml-cuda.cu:3955:9: note: in instantiation of function template specialization 'load_tiles_q2_K<128, 8, true>' requested here load_tiles_q2_K, VDR_Q2_K_Q8_1_MMQ, vec_dot_q2_K_q8_1_mul_mat> ^ ggml-cuda.cu:5387:9: note: in instantiation of function template specialization 'mul_mat_q2_K' requested here mul_mat_q2_K<<>> ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:3954:5: note: in instantiation of function template specialization 'mul_mat_q<256, 4, 16, false, block_q2_K, 64, 128, 8, &allocate_tiles_q2_K, &load_tiles_q2_K, 2, &vec_dot_q2_K_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5387:9: note: in instantiation of function template specialization 'mul_mat_q2_K' requested here mul_mat_q2_K<<>> ^ ggml-cuda.cu:2866:45: warning: cast from 'const void *' to 'block_q3_K *' drops const qualifier [-Wcast-qual] const block_q3_K * bx0 = (block_q3_K *) vx; ^ ggml-cuda.cu:4024:9: note: in instantiation of function template specialization 'load_tiles_q3_K<64, 8, false>' requested here load_tiles_q3_K, VDR_Q3_K_Q8_1_MMQ, vec_dot_q3_K_q8_1_mul_mat> ^ ggml-cuda.cu:5430:9: note: in instantiation of function template specialization 'mul_mat_q3_K' requested here mul_mat_q3_K<<>> ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:4023:5: note: in instantiation of function template specialization 'mul_mat_q<256, 4, 16, false, block_q3_K, 128, 64, 8, &allocate_tiles_q3_K, &load_tiles_q3_K, 2, &vec_dot_q3_K_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5430:9: note: in instantiation of function template specialization 'mul_mat_q3_K' requested here mul_mat_q3_K<<>> ^ ggml-cuda.cu:2866:45: warning: cast from 'const void *' to 'block_q3_K *' drops const qualifier [-Wcast-qual] const block_q3_K * bx0 = (block_q3_K *) vx; ^ ggml-cuda.cu:4024:9: note: in instantiation of function template specialization 'load_tiles_q3_K<64, 8, true>' requested here load_tiles_q3_K, VDR_Q3_K_Q8_1_MMQ, vec_dot_q3_K_q8_1_mul_mat> ^ ggml-cuda.cu:5434:9: note: in instantiation of function template specialization 'mul_mat_q3_K' requested here mul_mat_q3_K<<>> ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:4023:5: note: in instantiation of function template specialization 'mul_mat_q<256, 4, 16, false, block_q3_K, 128, 64, 8, &allocate_tiles_q3_K, &load_tiles_q3_K, 2, &vec_dot_q3_K_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5434:9: note: in instantiation of function template specialization 'mul_mat_q3_K' requested here mul_mat_q3_K<<>> ^ ggml-cuda.cu:3084:45: warning: cast from 'const void *' to 'block_q4_K *' drops const qualifier [-Wcast-qual] const block_q4_K * bx0 = (block_q4_K *) vx; ^ ggml-cuda.cu:4093:9: note: in instantiation of function template specialization 'load_tiles_q4_K<128, 8, false>' requested here load_tiles_q4_K, VDR_Q4_K_Q8_1_MMQ, vec_dot_q4_K_q8_1_mul_mat> ^ ggml-cuda.cu:5476:9: note: in instantiation of function template specialization 'mul_mat_q4_K' requested here mul_mat_q4_K<<>> ^ ggml-cuda.cu:3129:38: warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual] const int * scales = (int *) bxi->scales; ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:4092:5: note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, true, block_q4_K, 64, 128, 8, &allocate_tiles_q4_K, &load_tiles_q4_K, 8, &vec_dot_q4_K_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5476:9: note: in instantiation of function template specialization 'mul_mat_q4_K' requested here mul_mat_q4_K<<>> ^ ggml-cuda.cu:3084:45: warning: cast from 'const void *' to 'block_q4_K *' drops const qualifier [-Wcast-qual] const block_q4_K * bx0 = (block_q4_K *) vx; ^ ggml-cuda.cu:4093:9: note: in instantiation of function template specialization 'load_tiles_q4_K<128, 8, true>' requested here load_tiles_q4_K, VDR_Q4_K_Q8_1_MMQ, vec_dot_q4_K_q8_1_mul_mat> ^ ggml-cuda.cu:5480:9: note: in instantiation of function template specialization 'mul_mat_q4_K' requested here mul_mat_q4_K<<>> ^ ggml-cuda.cu:3129:38: warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual] const int * scales = (int *) bxi->scales; ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:4092:5: note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, true, block_q4_K, 64, 128, 8, &allocate_tiles_q4_K, &load_tiles_q4_K, 8, &vec_dot_q4_K_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5480:9: note: in instantiation of function template specialization 'mul_mat_q4_K' requested here mul_mat_q4_K<<>> ^ ggml-cuda.cu:3265:45: warning: cast from 'const void *' to 'block_q5_K *' drops const qualifier [-Wcast-qual] const block_q5_K * bx0 = (block_q5_K *) vx; ^ ggml-cuda.cu:4160:9: note: in instantiation of function template specialization 'load_tiles_q5_K<128, 8, false>' requested here load_tiles_q5_K, VDR_Q5_K_Q8_1_MMQ, vec_dot_q5_K_q8_1_mul_mat> ^ ggml-cuda.cu:5521:9: note: in instantiation of function template specialization 'mul_mat_q5_K' requested here mul_mat_q5_K<<>> ^ ggml-cuda.cu:3321:38: warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual] const int * scales = (int *) bxi->scales; ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:4159:5: note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, true, block_q5_K, 64, 128, 8, &allocate_tiles_q5_K, &load_tiles_q5_K, 8, &vec_dot_q5_K_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5521:9: note: in instantiation of function template specialization 'mul_mat_q5_K' requested here mul_mat_q5_K<<>> ^ ggml-cuda.cu:3265:45: warning: cast from 'const void *' to 'block_q5_K *' drops const qualifier [-Wcast-qual] const block_q5_K * bx0 = (block_q5_K *) vx; ^ ggml-cuda.cu:4160:9: note: in instantiation of function template specialization 'load_tiles_q5_K<128, 8, true>' requested here load_tiles_q5_K, VDR_Q5_K_Q8_1_MMQ, vec_dot_q5_K_q8_1_mul_mat> ^ ggml-cuda.cu:5525:9: note: in instantiation of function template specialization 'mul_mat_q5_K' requested here mul_mat_q5_K<<>> ^ ggml-cuda.cu:3321:38: warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual] const int * scales = (int *) bxi->scales; ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:4159:5: note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, true, block_q5_K, 64, 128, 8, &allocate_tiles_q5_K, &load_tiles_q5_K, 8, &vec_dot_q5_K_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5525:9: note: in instantiation of function template specialization 'mul_mat_q5_K' requested here mul_mat_q5_K<<>> ^ ggml-cuda.cu:3394:45: warning: cast from 'const void *' to 'block_q6_K *' drops const qualifier [-Wcast-qual] const block_q6_K * bx0 = (block_q6_K *) vx; ^ ggml-cuda.cu:4229:9: note: in instantiation of function template specialization 'load_tiles_q6_K<128, 8, false>' requested here load_tiles_q6_K, VDR_Q6_K_Q8_1_MMQ, vec_dot_q6_K_q8_1_mul_mat> ^ ggml-cuda.cu:5566:9: note: in instantiation of function template specialization 'mul_mat_q6_K' requested here mul_mat_q6_K<<>> ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:4228:5: note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, false, block_q6_K, 64, 128, 8, &allocate_tiles_q6_K, &load_tiles_q6_K, 8, &vec_dot_q6_K_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5566:9: note: in instantiation of function template specialization 'mul_mat_q6_K' requested here mul_mat_q6_K<<>> ^ ggml-cuda.cu:3394:45: warning: cast from 'const void *' to 'block_q6_K *' drops const qualifier [-Wcast-qual] const block_q6_K * bx0 = (block_q6_K *) vx; ^ ggml-cuda.cu:4229:9: note: in instantiation of function template specialization 'load_tiles_q6_K<128, 8, true>' requested here load_tiles_q6_K, VDR_Q6_K_Q8_1_MMQ, vec_dot_q6_K_q8_1_mul_mat> ^ ggml-cuda.cu:5570:9: note: in instantiation of function template specialization 'mul_mat_q6_K' requested here mul_mat_q6_K<<>> ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:4228:5: note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, false, block_q6_K, 64, 128, 8, &allocate_tiles_q6_K, &load_tiles_q6_K, 8, &vec_dot_q6_K_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5570:9: note: in instantiation of function template specialization 'mul_mat_q6_K' requested here mul_mat_q6_K<<>> ^ 110 warnings generated when compiling for gfx1100. warning: unknown warning option '-Wno-format-truncation' [-Wunknown-warning-option] ggml-cuda.cu:235:41: warning: cast from 'const signed char *' to 'unsigned short *' drops const qualifier [-Wcast-qual] const uint16_t * x16 = (uint16_t *) (x8 + sizeof(int) * i32); // assume at least 2 byte alignment ^ ggml-cuda.cu:245:41: warning: cast from 'const unsigned char *' to 'unsigned short *' drops const qualifier [-Wcast-qual] const uint16_t * x16 = (uint16_t *) (x8 + sizeof(int) * i32); // assume at least 2 byte alignment ^ ggml-cuda.cu:255:22: warning: cast from 'const signed char *' to 'int *' drops const qualifier [-Wcast-qual] return *((int *) (x8 + sizeof(int) * i32)); // assume at least 4 byte alignment ^ ggml-cuda.cu:259:22: warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual] return *((int *) (x8 + sizeof(int) * i32)); // assume at least 4 byte alignment ^ ggml-cuda.cu:467:75: warning: suggest braces around initialization of subobject [-Wmissing-braces] static cudaStream_t g_cudaStreams[GGML_CUDA_MAX_DEVICES][MAX_STREAMS] = { nullptr }; ^~~~~~~ { } ggml-cuda.cu:2227:116: warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q4_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:2227:129: warning: unused parameter 'x_sc' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q4_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:2248:45: warning: cast from 'const void *' to 'block_q4_0 *' drops const qualifier [-Wcast-qual] const block_q4_0 * bx0 = (block_q4_0 *) vx; ^ ggml-cuda.cu:2237:106: warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ ggml-cuda.cu:2238:24: warning: unused parameter 'x_sc' [-Wunused-parameter] int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) { ^ ggml-cuda.cu:2288:37: warning: cast from 'const __half2 *' to 'float *' drops const qualifier [-Wcast-qual] const float * x_dmf = (float *) x_dm; ^ ggml-cuda.cu:2284:94: warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:2284:125: warning: unused parameter 'x_sc' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:2321:116: warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q4_1(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:2321:129: warning: unused parameter 'x_sc' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q4_1(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:2342:45: warning: cast from 'const void *' to 'block_q4_1 *' drops const qualifier [-Wcast-qual] const block_q4_1 * bx0 = (block_q4_1 *) vx; ^ ggml-cuda.cu:2331:106: warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ ggml-cuda.cu:2332:24: warning: unused parameter 'x_sc' [-Wunused-parameter] int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) { ^ ggml-cuda.cu:2375:94: warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:2375:125: warning: unused parameter 'x_sc' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:2413:116: warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q5_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:2413:129: warning: unused parameter 'x_sc' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q5_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:2434:45: warning: cast from 'const void *' to 'block_q5_0 *' drops const qualifier [-Wcast-qual] const block_q5_0 * bx0 = (block_q5_0 *) vx; ^ ggml-cuda.cu:2423:106: warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ ggml-cuda.cu:2424:24: warning: unused parameter 'x_sc' [-Wunused-parameter] int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) { ^ ggml-cuda.cu:2487:94: warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:2487:125: warning: unused parameter 'x_sc' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:2527:116: warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q5_1(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:2527:129: warning: unused parameter 'x_sc' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q5_1(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:2548:45: warning: cast from 'const void *' to 'block_q5_1 *' drops const qualifier [-Wcast-qual] const block_q5_1 * bx0 = (block_q5_1 *) vx; ^ ggml-cuda.cu:2537:106: warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ ggml-cuda.cu:2538:24: warning: unused parameter 'x_sc' [-Wunused-parameter] int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) { ^ ggml-cuda.cu:2598:94: warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:2598:125: warning: unused parameter 'x_sc' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:2633:116: warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q8_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:2633:129: warning: unused parameter 'x_sc' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q8_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:2655:45: warning: cast from 'const void *' to 'block_q8_0 *' drops const qualifier [-Wcast-qual] const block_q8_0 * bx0 = (block_q8_0 *) vx; ^ ggml-cuda.cu:2643:106: warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ ggml-cuda.cu:2644:24: warning: unused parameter 'x_sc' [-Wunused-parameter] int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) { ^ ggml-cuda.cu:2688:94: warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:2688:125: warning: unused parameter 'x_sc' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:2722:116: warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q2_K(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:2745:45: warning: cast from 'const void *' to 'block_q2_K *' drops const qualifier [-Wcast-qual] const block_q2_K * bx0 = (block_q2_K *) vx; ^ ggml-cuda.cu:2734:106: warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ ggml-cuda.cu:2791:94: warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:2866:45: warning: cast from 'const void *' to 'block_q3_K *' drops const qualifier [-Wcast-qual] const block_q3_K * bx0 = (block_q3_K *) vx; ^ ggml-cuda.cu:2947:41: warning: cast from 'const int *' to 'signed char *' drops const qualifier [-Wcast-qual] const int8_t * scales = ((int8_t *) (x_sc + i * (WARP_SIZE/4) + i/4 + kbx*4)) + ky/4; ^ ggml-cuda.cu:3061:116: warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q4_K(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:3084:45: warning: cast from 'const void *' to 'block_q4_K *' drops const qualifier [-Wcast-qual] const block_q4_K * bx0 = (block_q4_K *) vx; ^ ggml-cuda.cu:3129:38: warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual] const int * scales = (int *) bxi->scales; ^ ggml-cuda.cu:3073:106: warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ ggml-cuda.cu:3142:94: warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:3242:116: warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q5_K(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:3265:45: warning: cast from 'const void *' to 'block_q5_K *' drops const qualifier [-Wcast-qual] const block_q5_K * bx0 = (block_q5_K *) vx; ^ ggml-cuda.cu:3321:38: warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual] const int * scales = (int *) bxi->scales; ^ ggml-cuda.cu:3254:106: warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ ggml-cuda.cu:3334:94: warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:3371:116: warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q6_K(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ ggml-cuda.cu:3394:45: warning: cast from 'const void *' to 'block_q6_K *' drops const qualifier [-Wcast-qual] const block_q6_K * bx0 = (block_q6_K *) vx; ^ ggml-cuda.cu:3383:106: warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ ggml-cuda.cu:3454:94: warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ ggml-cuda.cu:6947:26: warning: unused variable 'size_src0_ddq' [-Wunused-variable] const size_t size_src0_ddq = split ? (row_high[id]-row_low[id])*ne00 * src0_ts/src0_bs : ggml_nbytes(src0); ^ ggml-cuda.cu:6846:19: warning: unused variable 'nrows0' [-Wunused-variable] const int64_t nrows0 = ggml_nrows(src0); ^ ggml-cuda.cu:8135:103: warning: function 'ggml_backend_cuda_graph_plan_free' could be declared with attribute 'noreturn' [-Wmissing-noreturn] static void ggml_backend_cuda_graph_plan_free(ggml_backend_t backend, ggml_backend_graph_plan_t plan) { ^ ggml-cuda.cu:8142:106: warning: function 'ggml_backend_cuda_graph_plan_compute' could be declared with attribute 'noreturn' [-Wmissing-noreturn] static void ggml_backend_cuda_graph_plan_compute(ggml_backend_t backend, ggml_backend_graph_plan_t plan) { ^ ggml-cuda.cu:2248:45: warning: cast from 'const void *' to 'block_q4_0 *' drops const qualifier [-Wcast-qual] const block_q4_0 * bx0 = (block_q4_0 *) vx; ^ ggml-cuda.cu:3618:9: note: in instantiation of function template specialization 'load_tiles_q4_0<64, 8, false>' requested here load_tiles_q4_0, VDR_Q4_0_Q8_1_MMQ, vec_dot_q4_0_q8_1_mul_mat> ^ ggml-cuda.cu:5158:9: note: in instantiation of function template specialization 'mul_mat_q4_0' requested here mul_mat_q4_0<<>> ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:3617:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q4_0, 64, 64, 8, &allocate_tiles_q4_0, &load_tiles_q4_0, 4, &vec_dot_q4_0_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5158:9: note: in instantiation of function template specialization 'mul_mat_q4_0' requested here mul_mat_q4_0<<>> ^ ggml-cuda.cu:2248:45: warning: cast from 'const void *' to 'block_q4_0 *' drops const qualifier [-Wcast-qual] const block_q4_0 * bx0 = (block_q4_0 *) vx; ^ ggml-cuda.cu:3618:9: note: in instantiation of function template specialization 'load_tiles_q4_0<64, 8, true>' requested here load_tiles_q4_0, VDR_Q4_0_Q8_1_MMQ, vec_dot_q4_0_q8_1_mul_mat> ^ ggml-cuda.cu:5162:9: note: in instantiation of function template specialization 'mul_mat_q4_0' requested here mul_mat_q4_0<<>> ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:3617:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q4_0, 64, 64, 8, &allocate_tiles_q4_0, &load_tiles_q4_0, 4, &vec_dot_q4_0_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5162:9: note: in instantiation of function template specialization 'mul_mat_q4_0' requested here mul_mat_q4_0<<>> ^ ggml-cuda.cu:2342:45: warning: cast from 'const void *' to 'block_q4_1 *' drops const qualifier [-Wcast-qual] const block_q4_1 * bx0 = (block_q4_1 *) vx; ^ ggml-cuda.cu:3687:9: note: in instantiation of function template specialization 'load_tiles_q4_1<64, 8, false>' requested here load_tiles_q4_1, VDR_Q4_1_Q8_1_MMQ, vec_dot_q4_1_q8_1_mul_mat> ^ ggml-cuda.cu:5203:9: note: in instantiation of function template specialization 'mul_mat_q4_1' requested here mul_mat_q4_1<<>> ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:3686:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q4_1, 64, 64, 8, &allocate_tiles_q4_1, &load_tiles_q4_1, 4, &vec_dot_q4_1_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5203:9: note: in instantiation of function template specialization 'mul_mat_q4_1' requested here mul_mat_q4_1<<>> ^ ggml-cuda.cu:2342:45: warning: cast from 'const void *' to 'block_q4_1 *' drops const qualifier [-Wcast-qual] const block_q4_1 * bx0 = (block_q4_1 *) vx; ^ ggml-cuda.cu:3687:9: note: in instantiation of function template specialization 'load_tiles_q4_1<64, 8, true>' requested here load_tiles_q4_1, VDR_Q4_1_Q8_1_MMQ, vec_dot_q4_1_q8_1_mul_mat> ^ ggml-cuda.cu:5207:9: note: in instantiation of function template specialization 'mul_mat_q4_1' requested here mul_mat_q4_1<<>> ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:3686:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q4_1, 64, 64, 8, &allocate_tiles_q4_1, &load_tiles_q4_1, 4, &vec_dot_q4_1_q8_1_mul_mat>' requested here mul_mat_q, ^ ggml-cuda.cu:5207:9: note: in instantiation of function template specialization 'mul_mat_q4_1' requested here mul_mat_q4_1<<>> ^ ggml-cuda.cu:2434:45: warning: cast from 'const void *' to 'block_q5_0 *' drops const qualifier [-Wcast-qual] const block_q5_0 * bx0 = (block_q5_0 *) vx; ^ ggml-cuda.cu:3754:9: note: in instantiation of function template specialization 'load_tiles_q5_0<64, 8, false>' requested here load_tiles_q5_0, VDR_Q5_0_Q8_1_MMQ, vec_dot_q5_0_q8_1_mul_mat> ^ ggml-cuda.cu:5248:9: note: in instantiation of function template specialization 'mul_mat_q5_0' requested here mul_mat_q5_0<<>> ^ ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } ggml-cuda.cu:3753:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, false, block_q5_0, 64, 64, 8, &allocate_tiles_q5_0, &load_tiles_q5_0, 4, &vec_dot_q5_0_q8_1_mul_mat>' requested here mul_mat_q

./main -ngl 99 -m ../koboldcpp/models/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/mistral-7b-instruct-v0.1.Q5_K_M.gguf -mg 0 -p "Write a function in TypeScript that sums numbers"

``` Log start main: build = 1503 (4a4fd3e) main: built with cc (GCC) 13.2.1 20230801 for x86_64-pc-linux-gnu main: seed = 1699662201 ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 2 ROCm devices: Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0 Device 1: AMD Radeon RX 6900 XT, compute capability 10.3 llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from ../koboldcpp/models/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/mistral-7b-instruct-v0.1.Q5_K_M.gguf (version GGUF V2) llama_model_loader: - tensor 0: token_embd.weight q5_K [ 4096, 32000, 1, 1 ] llama_model_loader: - tensor 1: blk.0.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 2: blk.0.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 3: blk.0.attn_v.weight q6_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 4: blk.0.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 5: blk.0.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 6: blk.0.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 7: blk.0.ffn_down.weight q6_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 8: blk.0.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 9: blk.0.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 10: blk.1.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 11: blk.1.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 12: blk.1.attn_v.weight q6_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 13: blk.1.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 14: blk.1.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 15: blk.1.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 16: blk.1.ffn_down.weight q6_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 17: blk.1.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 18: blk.1.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 19: blk.2.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 20: blk.2.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 21: blk.2.attn_v.weight q6_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 22: blk.2.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 23: blk.2.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 24: blk.2.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 25: blk.2.ffn_down.weight q6_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 26: blk.2.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 27: blk.2.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 28: blk.3.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 29: blk.3.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 30: blk.3.attn_v.weight q6_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 31: blk.3.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 32: blk.3.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 33: blk.3.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 34: blk.3.ffn_down.weight q6_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 35: blk.3.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 36: blk.3.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 37: blk.4.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 38: blk.4.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 39: blk.4.attn_v.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 40: blk.4.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 41: blk.4.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 42: blk.4.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 43: blk.4.ffn_down.weight q5_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 44: blk.4.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 45: blk.4.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 46: blk.5.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 47: blk.5.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 48: blk.5.attn_v.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 49: blk.5.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 50: blk.5.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 51: blk.5.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 52: blk.5.ffn_down.weight q5_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 53: blk.5.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 54: blk.5.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 55: blk.6.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 56: blk.6.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 57: blk.6.attn_v.weight q6_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 58: blk.6.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 59: blk.6.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 60: blk.6.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 61: blk.6.ffn_down.weight q6_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 62: blk.6.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 63: blk.6.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 64: blk.7.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 65: blk.7.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 66: blk.7.attn_v.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 67: blk.7.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 68: blk.7.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 69: blk.7.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 70: blk.7.ffn_down.weight q5_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 71: blk.7.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 72: blk.7.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 73: blk.8.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 74: blk.8.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 75: blk.8.attn_v.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 76: blk.8.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 77: blk.8.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 78: blk.8.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 79: blk.8.ffn_down.weight q5_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 80: blk.8.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 81: blk.8.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 82: blk.9.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 83: blk.9.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 84: blk.9.attn_v.weight q6_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 85: blk.9.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 86: blk.9.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 87: blk.9.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 88: blk.9.ffn_down.weight q6_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 89: blk.9.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 90: blk.9.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 91: blk.10.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 92: blk.10.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 93: blk.10.attn_v.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 94: blk.10.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 95: blk.10.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 96: blk.10.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 97: blk.10.ffn_down.weight q5_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 98: blk.10.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 99: blk.10.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 100: blk.11.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 101: blk.11.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 102: blk.11.attn_v.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 103: blk.11.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 104: blk.11.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 105: blk.11.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 106: blk.11.ffn_down.weight q5_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 107: blk.11.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 108: blk.11.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 109: blk.12.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 110: blk.12.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 111: blk.12.attn_v.weight q6_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 112: blk.12.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 113: blk.12.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 114: blk.12.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 115: blk.12.ffn_down.weight q6_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 116: blk.12.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 117: blk.12.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 118: blk.13.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 119: blk.13.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 120: blk.13.attn_v.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 121: blk.13.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 122: blk.13.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 123: blk.13.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 124: blk.13.ffn_down.weight q5_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 125: blk.13.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 126: blk.13.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 127: blk.14.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 128: blk.14.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 129: blk.14.attn_v.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 130: blk.14.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 131: blk.14.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 132: blk.14.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 133: blk.14.ffn_down.weight q5_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 134: blk.14.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 135: blk.14.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 136: blk.15.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 137: blk.15.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 138: blk.15.attn_v.weight q6_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 139: blk.15.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 140: blk.15.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 141: blk.15.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 142: blk.15.ffn_down.weight q6_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 143: blk.15.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 144: blk.15.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 145: blk.16.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 146: blk.16.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 147: blk.16.attn_v.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 148: blk.16.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 149: blk.16.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 150: blk.16.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 151: blk.16.ffn_down.weight q5_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 152: blk.16.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 153: blk.16.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 154: blk.17.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 155: blk.17.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 156: blk.17.attn_v.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 157: blk.17.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 158: blk.17.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 159: blk.17.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 160: blk.17.ffn_down.weight q5_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 161: blk.17.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 162: blk.17.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 163: blk.18.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 164: blk.18.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 165: blk.18.attn_v.weight q6_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 166: blk.18.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 167: blk.18.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 168: blk.18.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 169: blk.18.ffn_down.weight q6_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 170: blk.18.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 171: blk.18.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 172: blk.19.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 173: blk.19.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 174: blk.19.attn_v.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 175: blk.19.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 176: blk.19.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 177: blk.19.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 178: blk.19.ffn_down.weight q5_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 179: blk.19.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 180: blk.19.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 181: blk.20.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 182: blk.20.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 183: blk.20.attn_v.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 184: blk.20.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 185: blk.20.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 186: blk.20.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 187: blk.20.ffn_down.weight q5_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 188: blk.20.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 189: blk.20.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 190: blk.21.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 191: blk.21.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 192: blk.21.attn_v.weight q6_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 193: blk.21.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 194: blk.21.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 195: blk.21.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 196: blk.21.ffn_down.weight q6_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 197: blk.21.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 198: blk.21.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 199: blk.22.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 200: blk.22.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 201: blk.22.attn_v.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 202: blk.22.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 203: blk.22.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 204: blk.22.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 205: blk.22.ffn_down.weight q5_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 206: blk.22.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 207: blk.22.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 208: blk.23.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 209: blk.23.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 210: blk.23.attn_v.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 211: blk.23.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 212: blk.23.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 213: blk.23.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 214: blk.23.ffn_down.weight q5_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 215: blk.23.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 216: blk.23.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 217: blk.24.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 218: blk.24.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 219: blk.24.attn_v.weight q6_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 220: blk.24.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 221: blk.24.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 222: blk.24.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 223: blk.24.ffn_down.weight q6_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 224: blk.24.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 225: blk.24.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 226: blk.25.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 227: blk.25.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 228: blk.25.attn_v.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 229: blk.25.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 230: blk.25.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 231: blk.25.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 232: blk.25.ffn_down.weight q5_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 233: blk.25.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 234: blk.25.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 235: blk.26.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 236: blk.26.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 237: blk.26.attn_v.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 238: blk.26.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 239: blk.26.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 240: blk.26.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 241: blk.26.ffn_down.weight q5_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 242: blk.26.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 243: blk.26.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 244: blk.27.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 245: blk.27.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 246: blk.27.attn_v.weight q6_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 247: blk.27.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 248: blk.27.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 249: blk.27.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 250: blk.27.ffn_down.weight q6_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 251: blk.27.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 252: blk.27.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 253: blk.28.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 254: blk.28.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 255: blk.28.attn_v.weight q6_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 256: blk.28.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 257: blk.28.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 258: blk.28.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 259: blk.28.ffn_down.weight q6_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 260: blk.28.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 261: blk.28.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 262: blk.29.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 263: blk.29.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 264: blk.29.attn_v.weight q6_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 265: blk.29.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 266: blk.29.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 267: blk.29.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 268: blk.29.ffn_down.weight q6_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 269: blk.29.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 270: blk.29.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 271: blk.30.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 272: blk.30.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 273: blk.30.attn_v.weight q6_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 274: blk.30.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 275: blk.30.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 276: blk.30.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 277: blk.30.ffn_down.weight q6_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 278: blk.30.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 279: blk.30.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 280: blk.31.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 281: blk.31.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 282: blk.31.attn_v.weight q6_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 283: blk.31.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 284: blk.31.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 285: blk.31.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 286: blk.31.ffn_down.weight q6_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 287: blk.31.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 288: blk.31.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 289: output_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 290: output.weight q6_K [ 4096, 32000, 1, 1 ] llama_model_loader: - kv 0: general.architecture str llama_model_loader: - kv 1: general.name str llama_model_loader: - kv 2: llama.context_length u32 llama_model_loader: - kv 3: llama.embedding_length u32 llama_model_loader: - kv 4: llama.block_count u32 llama_model_loader: - kv 5: llama.feed_forward_length u32 llama_model_loader: - kv 6: llama.rope.dimension_count u32 llama_model_loader: - kv 7: llama.attention.head_count u32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 llama_model_loader: - kv 10: llama.rope.freq_base f32 llama_model_loader: - kv 11: general.file_type u32 llama_model_loader: - kv 12: tokenizer.ggml.model str llama_model_loader: - kv 13: tokenizer.ggml.tokens arr llama_model_loader: - kv 14: tokenizer.ggml.scores arr llama_model_loader: - kv 15: tokenizer.ggml.token_type arr llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 llama_model_loader: - kv 19: general.quantization_version u32 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q5_K: 193 tensors llama_model_loader: - type q6_K: 33 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V2 llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = mostly Q5_K - Medium llm_load_print_meta: model params = 7.24 B llm_load_print_meta: model size = 4.78 GiB (5.67 BPW) llm_load_print_meta: general.name = mistralai_mistral-7b-instruct-v0.1 llm_load_print_meta: BOS token = 1 '~~' llm_load_print_meta: EOS token = 2 '~~' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.11 MB llm_load_tensors: using ROCm for GPU acceleration ggml_cuda_set_main_device: using device 0 (AMD Radeon RX 7900 XTX) as main device llm_load_tensors: mem required = 86.04 MB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 35/35 layers to GPU llm_load_tensors: VRAM used: 4807.05 MB .................................................................................................. llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: offloading v cache to GPU llama_kv_cache_init: offloading k cache to GPU llama_kv_cache_init: VRAM kv self = 64.00 MB llama_new_context_with_model: kv self size = 64.00 MB llama_build_graph: non-view tensors processed: 740/740 llama_new_context_with_model: compute buffer total size = 79.63 MB llama_new_context_with_model: VRAM scratch buffer: 73.00 MB llama_new_context_with_model: total VRAM used: 4944.06 MB (model: 4807.05 MB, context: 137.00 MB) fish: Job 1, './main -ngl 99 -m ../koboldcpp/…' terminated by signal SIGSEGV (Address boundary error) ```

8XXD8 commented 10 months ago

This could be Rdna3/gfx1100 specific. I cross compiled from a debian nvidia build for gfx900 target and it worked fine

8XXD8 commented 10 months ago

Well i managed to run into the same problem. After updating llama-cpp-python the only response i get is an endless stream of "K" characters. I tried compiling libllama.so separately, but had the same result.

Its odd that main works with this:

make main LLAMA_HIPBLAS=on

but libllama.so produces garbage

make libllama.so LLAMA_HIPBLAS=on

bojak83318 commented 9 months ago

Anyways my lanes are not equal , this could be a problem as in rocm 5.7 multi gpu is just a preview ''' Recommended system configuration for multi-GPU#

PCIe® slots connected to the GPU must have identical PCIe lane width or bifurcation settings, and support PCIe 3.0 Atomics.

Refer to How ROCm uses PCIe Atomics for more information.

Example:

✓ - GPU0 PCIe x16 connection + GPU1 PCIe x16 connection

✓ - GPU0 PCIe x8 connection + GPU1 PCIe x8 connection

X - GPU0 PCIe x16 connection + GPU1 PCIe x8 connection

''' Source : https://rocm.docs.amd.com/projects/radeon/en/docs-5.7.0/docs/limitations.html

8XXD8 commented 9 months ago

Anyways my lanes are not equal , this could be a problem as in rocm 5.7 multi gpu is just a preview ''' Recommended system configuration for multi-GPU#

PCIe® slots connected to the GPU must have identical PCIe lane width or bifurcation settings, and support PCIe 3.0 Atomics.

Refer to How ROCm uses PCIe Atomics for more information.

Example:

✓ - GPU0 PCIe x16 connection + GPU1 PCIe x16 connection

✓ - GPU0 PCIe x8 connection + GPU1 PCIe x8 connection

X - GPU0 PCIe x16 connection + GPU1 PCIe x8 connection

''' Source : https://rocm.docs.amd.com/projects/radeon/en/docs-5.7.0/docs/limitations.html

Have you tried removing one of the GPU-s? I mean physically, not just disabling it with HIP_VISIBLE_DEVICES. That way you can see if it is a multi gpu / pcie lane issue, or something else.

YellowRoseCx commented 9 months ago

I believe the problem lies in how the initialization process is bugged in ROCm. It has been fixed, but the fix will not be released until ROCm 6.0.0 I believe; unless you build ROCm using a self compiled ROCBlas and Tensile yourself after these commits: rocBLAS Commit ID: bc4d8f5 Tensile Commit ID: ROCmSoftwarePlatform/Tensile@24d54d7

github-actions[bot] commented 5 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

ggerganov / llama.cpp

Stuck loading VRAM ROCm multi gpu #3991

Context

Steps to Reproduce

Failure Logs