ggerganov / llama.cpp

LLM inference in C/C++
MIT License
64.57k stars 9.24k forks source link

Stuck loading VRAM ROCm multi gpu #3991

Closed bojak83318 closed 5 months ago

bojak83318 commented 10 months ago


Once it loads it stuck at loading VRAM

My computer is running Dual AMD GPU 7900 XTX and 7900 XT Ubuntu 22.04 , ROCm 5.7


========================= ROCm System Management Interface =========================
=================================== Concise Info ===================================
GPU  Temp (DieEdge)  AvgPwr  SCLK   MCLK   Fan     Perf  PwrCap  VRAM%  GPU%  
0    69.0c           26.0W   28Mhz  96Mhz  22.75%  auto  291.0W   67%   0%    
1    50.0c           30.0W   33Mhz  96Mhz  14.9%   auto  282.0W   67%   0%    
=============================== End of ROCm SMI Log ================================

$ python3 --version Python 3.10.12 $ make --version GNU Make 4.3 Built for x86_64-pc-linux-gnu $ g++ --version g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

Steps to Reproduce

./main -ngl 99 -m ../llama_cpp_models/llama-2-70b-chat.Q4_0.gguf -mg 0 -p "Write a function in TypeScript that sums numbers"

Failure Logs

Log start
main: build = 1487 (c41ea36)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1699438381
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 ROCm devices:
  Device 0: Radeon RX 7900 XTX, compute capability 11.0
  Device 1: Radeon RX 7900 XT, compute capability 11.0
llama_model_loader: loaded meta data with 19 key-value pairs and 723 tensors from ../llama_cpp_models/llama-2-70b-chat.Q4_0.gguf (version GGUF V2)
llama_model_loader: - tensor    0:                token_embd.weight q4_0     [  8192, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q4_0     [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q4_0     [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q4_0     [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  718:        blk.79.attn_output.weight q4_0     [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  719:             blk.79.attn_q.weight q4_0     [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  720:             blk.79.attn_v.weight q4_0     [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  721:               output_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  722:                    output.weight q6_K     [  8192, 32000,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str     
llama_model_loader: - kv   1:                      str     
llama_model_loader: - kv   2:                       llama.context_length u32     
llama_model_loader: - kv   3:                     llama.embedding_length u32     
llama_model_loader: - kv   4:                          llama.block_count u32     
llama_model_loader: - kv   5:                  llama.feed_forward_length u32     
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32     
llama_model_loader: - kv   7:                 llama.attention.head_count u32     
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32     
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32     
llama_model_loader: - kv  10:                          general.file_type u32     
llama_model_loader: - kv  11:                       tokenizer.ggml.model str     
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr     
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr     
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr     
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32     
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32     
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32     
llama_model_loader: - kv  18:               general.quantization_version u32     
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q4_0:  561 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = mostly Q4_0
llm_load_print_meta: model params     = 68.98 B
llm_load_print_meta: model size       = 36.20 GiB (4.51 BPW) 
llm_load_print_meta:   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.26 MB
llm_load_tensors: using ROCm for GPU acceleration
ggml_cuda_set_main_device: using device 0 (Radeon RX 7900 XTX) as main device
llm_load_tensors: mem required  =  140.89 MB
llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 83/83 layers to GPU
llm_load_tensors: VRAM used: 36930.11 MB
wizd commented 10 months ago

my rig with 2 7900 XTX:

super@super-System-Product-Name:~/apps/llama.cpp$ rocm-smi

========================= ROCm System Management Interface =========================
=================================== Concise Info ===================================
ERROR: GPU[2]   : sclk clock is unsupported
GPU[2]          : get_power_cap, Not supported on the given system
GPU  Temp (DieEdge)  AvgPwr   SCLK     MCLK     Fan  Perf  PwrCap       VRAM%  GPU%
0    33.0c           24.0W    295Mhz   96Mhz    0%   auto  339.0W         0%   10%
1    31.0c           81.0W    1514Mhz  96Mhz    0%   auto  339.0W         0%   56%
2    37.0c           55.141W  None     1800Mhz  0%   auto  Unsupported    9%   0%
=============================== End of ROCm SMI Log ================================
super@super-System-Product-Name:~/apps/llama.cpp$ rocminfo
ROCk module is loaded
HSA System Attributes
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE
Mwaitx:                  DISABLED
DMAbuf Support:          YES

HSA Agents
Agent 1
  Name:                    AMD Ryzen 9 7950X3D 16-Core Processor
  Uuid:                    CPU-XX
  Marketing Name:          AMD Ryzen 9 7950X3D 16-Core Processor
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    0
  Device Type:             CPU
  Cache Info:
    L1:                      32768(0x8000) KB
  Chip ID:                 0(0x0)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   4200
  BDFID:                   0
  Internal Node ID:        0
  Compute Unit:            32
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    131082728(0x7d029e8) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    131082728(0x7d029e8) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 3
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    131082728(0x7d029e8) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
Agent 2
  Name:                    gfx1100
  Uuid:                    GPU-aee456bdb1c699e6
  Marketing Name:          Radeon RX 7900 XTX
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             GPU
  Cache Info:
    L1:                      32(0x20) KB
    L2:                      6144(0x1800) KB
    L3:                      98304(0x18000) KB
  Chip ID:                 29772(0x744c)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2526
  BDFID:                   768
  Internal Node ID:        1
  Compute Unit:            96
  SIMDs per CU:            2
  Shader Engines:          6
  Shader Arrs. per Eng.:   2
  WatchPts on Addr. Ranges:4
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          32(0x20)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    1024(0x400)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 546
  SDMA engine uCode::      19
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    25149440(0x17fc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS:
      Size:                    25149440(0x17fc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx1100
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
Agent 3
  Name:                    gfx1100
  Uuid:                    GPU-398a3f843a146602
  Marketing Name:          Radeon RX 7900 XTX
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    2
  Device Type:             GPU
  Cache Info:
    L1:                      32(0x20) KB
    L2:                      6144(0x1800) KB
    L3:                      98304(0x18000) KB
  Chip ID:                 29772(0x744c)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2526
  BDFID:                   2304
  Internal Node ID:        2
  Compute Unit:            96
  SIMDs per CU:            2
  Shader Engines:          6
  Shader Arrs. per Eng.:   2
  WatchPts on Addr. Ranges:4
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          32(0x20)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    1024(0x400)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 546
  SDMA engine uCode::      19
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    25149440(0x17fc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS:
      Size:                    25149440(0x17fc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx1100
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
Agent 4
  Name:                    gfx1036
  Uuid:                    GPU-XX
  Marketing Name:          AMD Radeon Graphics
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    3
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
    L2:                      256(0x100) KB
  Chip ID:                 5710(0x164e)
  ASIC Revision:           1(0x1)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2200
  BDFID:                   4352
  Internal Node ID:        3
  Compute Unit:            2
  SIMDs per CU:            2
  Shader Engines:          1
  Shader Arrs. per Eng.:   1
  WatchPts on Addr. Ranges:4
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          32(0x20)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    1024(0x400)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 20
  SDMA engine uCode::      8
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    524288(0x80000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS:
      Size:                    524288(0x80000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx1036
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*** Done ***
super@super-System-Product-Name:~/apps/llama.cpp$ HIP_VISIBLE_DEVICES=0 ./bin/main -m ../models/openbuddy-mistral-7b-v13.1.Q8_0.gguf -
n 128 -ngl 50
Log start
main: build = 1499 (875fb42)
main: built with AMD clang version 17.0.0 ( roc-5.7.0 23352 d1e13c532a947d0cbfc94759c00dcf152294aa13) for x86_64-unknown-linux-gnu
main: seed  = 1699491182
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
  Device 0: Radeon RX 7900 XTX, compute capability 11.0
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from ../models/openbuddy-mistral-7b-v13.1.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q8_0     [  4096, 36608,     1,     1 ]
llama_model_loader: - tensor    1:              blk.0.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    2:              blk.0.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor    3:              blk.0.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor    4:         blk.0.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor    7:            blk.0.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor    8:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    9:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   10:              blk.1.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   11:              blk.1.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   12:              blk.1.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   13:         blk.1.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   14:            blk.1.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   15:              blk.1.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   16:            blk.1.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   17:           blk.1.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   18:            blk.1.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   19:              blk.2.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   20:              blk.2.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   21:              blk.2.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   22:         blk.2.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   23:            blk.2.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   24:              blk.2.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   25:            blk.2.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   26:           blk.2.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   27:            blk.2.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   28:              blk.3.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   29:              blk.3.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   30:              blk.3.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   31:         blk.3.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   32:            blk.3.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   33:              blk.3.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   34:            blk.3.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   35:           blk.3.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   36:            blk.3.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   37:              blk.4.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   38:              blk.4.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   39:              blk.4.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   40:         blk.4.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   41:            blk.4.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   42:              blk.4.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   43:            blk.4.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   44:           blk.4.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   45:            blk.4.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   46:              blk.5.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   47:              blk.5.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   48:              blk.5.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   49:         blk.5.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   50:            blk.5.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   51:              blk.5.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   52:            blk.5.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   53:           blk.5.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   54:            blk.5.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   55:              blk.6.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   56:              blk.6.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   57:              blk.6.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   58:         blk.6.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   59:            blk.6.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   60:              blk.6.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   61:            blk.6.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   62:           blk.6.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   63:            blk.6.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   64:              blk.7.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   65:              blk.7.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   66:              blk.7.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   67:         blk.7.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   68:            blk.7.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   69:              blk.7.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   70:            blk.7.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   71:           blk.7.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   72:            blk.7.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   73:              blk.8.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   74:              blk.8.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   75:              blk.8.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   76:         blk.8.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   77:            blk.8.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   78:              blk.8.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   79:            blk.8.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   80:           blk.8.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   81:            blk.8.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   82:              blk.9.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   83:              blk.9.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   84:              blk.9.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   85:         blk.9.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   86:            blk.9.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   87:              blk.9.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   88:            blk.9.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   89:           blk.9.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   90:            blk.9.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   91:             blk.10.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   92:             blk.10.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   93:             blk.10.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   94:        blk.10.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   95:           blk.10.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   96:             blk.10.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   97:           blk.10.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   98:          blk.10.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   99:           blk.10.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  100:             blk.11.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  101:             blk.11.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  102:             blk.11.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  103:        blk.11.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  104:           blk.11.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  105:             blk.11.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  106:           blk.11.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  107:          blk.11.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  108:           blk.11.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  109:             blk.12.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  110:             blk.12.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  111:             blk.12.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  112:        blk.12.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  113:           blk.12.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  114:             blk.12.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  115:           blk.12.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  116:          blk.12.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  117:           blk.12.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  118:             blk.13.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  119:             blk.13.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  120:             blk.13.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  121:        blk.13.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  122:           blk.13.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  123:             blk.13.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  124:           blk.13.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  125:          blk.13.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  126:           blk.13.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  127:             blk.14.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  128:             blk.14.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  129:             blk.14.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  130:        blk.14.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  131:           blk.14.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  132:             blk.14.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  133:           blk.14.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  134:          blk.14.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  135:           blk.14.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  136:             blk.15.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  137:             blk.15.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  138:             blk.15.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  139:        blk.15.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  140:           blk.15.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  141:             blk.15.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  142:           blk.15.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  143:          blk.15.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  144:           blk.15.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  145:             blk.16.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  146:             blk.16.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  147:             blk.16.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  148:        blk.16.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  149:           blk.16.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  150:             blk.16.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  151:           blk.16.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  152:          blk.16.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  153:           blk.16.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  154:             blk.17.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  155:             blk.17.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  156:             blk.17.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  157:        blk.17.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  158:           blk.17.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  159:             blk.17.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  160:           blk.17.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  161:          blk.17.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  162:           blk.17.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  163:             blk.18.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  164:             blk.18.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  165:             blk.18.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  166:        blk.18.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  167:           blk.18.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  168:             blk.18.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  169:           blk.18.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  170:          blk.18.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  171:           blk.18.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  172:             blk.19.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  173:             blk.19.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  174:             blk.19.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  175:        blk.19.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  176:           blk.19.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  177:             blk.19.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  178:           blk.19.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  179:          blk.19.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  180:           blk.19.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  181:             blk.20.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  182:             blk.20.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  183:             blk.20.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  184:        blk.20.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  185:           blk.20.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  186:             blk.20.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  187:           blk.20.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  188:          blk.20.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  189:           blk.20.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  190:             blk.21.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  191:             blk.21.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  192:             blk.21.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  193:        blk.21.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  194:           blk.21.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  195:             blk.21.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  196:           blk.21.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  197:          blk.21.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  198:           blk.21.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  199:             blk.22.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  200:             blk.22.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  201:             blk.22.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  202:        blk.22.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  203:           blk.22.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  204:             blk.22.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  205:           blk.22.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  206:          blk.22.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  207:           blk.22.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  208:             blk.23.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  209:             blk.23.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  210:             blk.23.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  211:        blk.23.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  212:           blk.23.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  213:             blk.23.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  214:           blk.23.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  215:          blk.23.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  216:           blk.23.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  217:             blk.24.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  218:             blk.24.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  219:             blk.24.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  220:        blk.24.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  221:           blk.24.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  222:             blk.24.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  223:           blk.24.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  224:          blk.24.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  225:           blk.24.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  226:             blk.25.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  227:             blk.25.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  228:             blk.25.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  229:        blk.25.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  230:           blk.25.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  231:             blk.25.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  232:           blk.25.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  233:          blk.25.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  234:           blk.25.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  235:             blk.26.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  236:             blk.26.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  237:             blk.26.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  238:        blk.26.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  239:           blk.26.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  240:             blk.26.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  241:           blk.26.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  242:          blk.26.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  243:           blk.26.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  244:             blk.27.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  245:             blk.27.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  246:             blk.27.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  247:        blk.27.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  248:           blk.27.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  249:             blk.27.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  250:           blk.27.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  251:          blk.27.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  252:           blk.27.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  253:             blk.28.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  254:             blk.28.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  255:             blk.28.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  256:        blk.28.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  257:           blk.28.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  258:             blk.28.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  259:           blk.28.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  260:          blk.28.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  261:           blk.28.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  262:             blk.29.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  263:             blk.29.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  264:             blk.29.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  265:        blk.29.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  266:           blk.29.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  267:             blk.29.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  268:           blk.29.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  269:          blk.29.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  270:           blk.29.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  271:             blk.30.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  272:             blk.30.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  273:             blk.30.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  274:        blk.30.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  275:           blk.30.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  276:             blk.30.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  277:           blk.30.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  278:          blk.30.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  279:           blk.30.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  280:             blk.31.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  281:             blk.31.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  282:             blk.31.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  283:        blk.31.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  284:           blk.31.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  285:             blk.31.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  286:           blk.31.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  287:          blk.31.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  288:           blk.31.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  289:               output_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  290:                    output.weight q8_0     [  4096, 36608,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:                      str
llama_model_loader: - kv   2:                       llama.context_length u32
llama_model_loader: - kv   3:                     llama.embedding_length u32
llama_model_loader: - kv   4:                          llama.block_count u32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32
llama_model_loader: - kv   7:                 llama.attention.head_count u32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv  10:                       llama.rope.freq_base f32
llama_model_loader: - kv  11:                          general.file_type u32
llama_model_loader: - kv  12:                       tokenizer.ggml.model str
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32
llama_model_loader: - kv  19:               general.quantization_version u32
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: mismatch in special tokens definition ( 361/36608 vs 259/36608 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 36608
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly Q8_0
llm_load_print_meta: model params     = 7.28 B
llm_load_print_meta: model size       = 7.20 GiB (8.50 BPW)
llm_load_print_meta:   = openbuddy_openbuddy-mistral-7b-v13.1
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MB
llm_load_tensors: using ROCm for GPU acceleration
llm_load_tensors: mem required  =  152.04 MB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 7224.95 MB
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 64.00 MB
llama_new_context_with_model: kv self size  =   64.00 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 86.13 MB
llama_new_context_with_model: VRAM scratch buffer: 79.50 MB
llama_new_context_with_model: total VRAM used: 7368.45 MB (model: 7224.95 MB, context: 143.50 MB)

CUDA error 98 at /home/super/apps/llama.cpp/ invalid device function
current device: 0
8XXD8 commented 10 months ago

I had the same kind of problems too. You have to build it with make, cmake caused the cuda errors for me. And for the stuck loading, try launching it with -no-mmap, you will need enough ram or swap for the full model. I had these problems with 2X MI25

wizd commented 10 months ago

Thanks @8XXD8 Now loading is OK but got endless '#' as output:

llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 39564.34 MB
llm_load_tensors: using ROCm for GPU acceleration
ggml_cuda_set_main_device: using device 0 (Radeon RX 7900 XTX) as main device
llm_load_tensors: mem required  =  165.64 MB
llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 83/83 layers to GPU
llm_load_tensors: VRAM used: 39398.70 MB
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 160.00 MB
llama_new_context_with_model: kv self size  =  160.00 MB
llama_build_graph: non-view tensors processed: 1844/1844
llama_new_context_with_model: compute buffer total size = 151.63 MB
llama_new_context_with_model: VRAM scratch buffer: 145.00 MB
llama_new_context_with_model: total VRAM used: 39703.71 MB (model: 39398.70 MB, context: 305.00 MB)

system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 512, n_batch = 512, n_predict = 1000, n_keep = 0

tell me a very long story################################################################################################################################################################################################################################.......


./main -m ../text-generation-webui/models/openbuddy-llama2-70b-v10.1.Q4_K_M.gguf -ngl 100 -p "tell me a very long story" -n 1000 --no-mmap
8XXD8 commented 10 months ago

Have you tried other models? Openbuddy might be sensitive to prompt format, I downloaded the 7b version to test and with your prompt sometimes it gave me only an end of text:

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 512, n_batch = 512, n_predict = 1000, n_keep = 0

tell me a very long story
 [end of text]

I had some proper responses too.

bojak83318 commented 10 months ago

must i make with

make clean && LLAMA_HIPLAS=1 && HIP_VISIBLE_DEVICES=1 make -j


make clean && LLAMA_HIPLAS=1 && HIP_VISIBLE_DEVICES=0,1 make -j

for multi gpu ?

wizd commented 10 months ago

must i make with

make clean && LLAMA_HIPLAS=1 && HIP_VISIBLE_DEVICES=1 make -j


make clean && LLAMA_HIPLAS=1 && HIP_VISIBLE_DEVICES=0,1 make -j

for multi gpu ?

wow you are so COOL! I got my first proper output by 2 gpu:

./main -m ../text-generation-webui/models/samantha-1.1-llama-33b.Q6_K.gguf -ngl 100 -p "tell me a very long story" -n 1000 --no-mmap

got output:

llm_load_print_meta: model ftype      = mostly Q6_K
llm_load_print_meta: model params     = 32.53 B
llm_load_print_meta: model size       = 24.85 GiB (6.56 BPW)
llm_load_print_meta:   = samantha-1.1-llama-33b.ggmlv3.q6_K.bin
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 25450.39 MB
llm_load_tensors: mem required  = 25450.39 MB
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  =  780.00 MB
llama_build_graph: non-view tensors processed: 1384/1384
llama_new_context_with_model: compute buffer total size = 103.63 MB

system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 512, n_batch = 512, n_predict = 1000, n_keep = 0

tell me a very long story, one that can keep us both entertained and engaged while we continue to chat. It's important for me to build our connection through stimulating conversations." [end of text]
bojak83318 commented 10 months ago

congrats so happy for u .. im still at

 llm_load_tensors: VRAM used: 36930.11 MB
bojak83318 commented 10 months ago

btw for make .. what are the commands did you run @wizd ?

wizd commented 10 months ago

btw for make .. what are the commands did you run @wizd ?

you are right. must do gpu selection on compile time.

bojak83318 commented 10 months ago

so its make clean && LLAMA_HIPLAS=1 && HIP_VISIBLE_DEVICES=0,1 make -j or make clean && LLAMA_HIPLAS=1 && HIP_VISIBLE_DEVICES="0,1" make -j

bojak83318 commented 10 months ago

if i use

make clean && LLAMA_HIPLAS=1 && HIP_VISIBLE_DEVICES="0,1" make -j

i get

warning: see main for information on enabling GPU BLAS support
warning: llama.cpp was compiled without cuBLAS. It is not possible to set a main GPU.
bojak83318 commented 10 months ago

only possible with this make command

make clean && LLAMA_HIPBLAS=1 make -j

and the output is

main: build = 1501 (a75fa57)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1699588357
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 ROCm devices:
  Device 0: Radeon RX 7900 XTX, compute capability 11.0
  Device 1: Radeon RX 7900 XT, compute capability 11.0

but i get jibberish

Write a function in TypeScript that sums numbers Geography#####################################################
wizd commented 10 months ago

sorry my post was wrong because this line compile without gpu support and runs on cpu:

make clean && LLAMA_HIPLAS=1 && HIP_VISIBLE_DEVICES=0,1 make -j

so the bug is still there. when turn on two gpu:

llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 780.00 MB
llama_new_context_with_model: kv self size  =  780.00 MB
llama_build_graph: non-view tensors processed: 1384/1384
llama_new_context_with_model: compute buffer total size = 103.63 MB
llama_new_context_with_model: VRAM scratch buffer: 97.00 MB
llama_new_context_with_model: total VRAM used: 26160.57 MB (model: 25283.57 MB, context: 877.00 MB)

system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 512, n_batch = 512, n_predict = 1000, n_keep = 0

bojak83318 commented 10 months ago

yes when i use 2 gpu its a problem

bojak83318 commented 10 months ago

i posted upstream on RCCL github

bojak83318 commented 10 months ago

sorry my post was wrong because this line compile without gpu support and runs on cpu:

make clean && LLAMA_HIPLAS=1 && HIP_VISIBLE_DEVICES=0,1 make -j

so the bug is still there. when turn on two gpu:

llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 780.00 MB
llama_new_context_with_model: kv self size  =  780.00 MB
llama_build_graph: non-view tensors processed: 1384/1384
llama_new_context_with_model: compute buffer total size = 103.63 MB
llama_new_context_with_model: VRAM scratch buffer: 97.00 MB
llama_new_context_with_model: total VRAM used: 26160.57 MB (model: 25283.57 MB, context: 877.00 MB)

system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 512, n_batch = 512, n_predict = 1000, n_keep = 0


so its just this that works with gpu

make clean && LLAMA_HIPBLAS=1 make -j

right ?

wizd commented 10 months ago

sorry my post was wrong because this line compile without gpu support and runs on cpu:

make clean && LLAMA_HIPLAS=1 && HIP_VISIBLE_DEVICES=0,1 make -j

so the bug is still there. when turn on two gpu:

llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 780.00 MB
llama_new_context_with_model: kv self size  =  780.00 MB
llama_build_graph: non-view tensors processed: 1384/1384
llama_new_context_with_model: compute buffer total size = 103.63 MB
llama_new_context_with_model: VRAM scratch buffer: 97.00 MB
llama_new_context_with_model: total VRAM used: 26160.57 MB (model: 25283.57 MB, context: 877.00 MB)

system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 512, n_batch = 512, n_predict = 1000, n_keep = 0


so its just this that works with gpu

make clean && LLAMA_HIPBLAS=1 make -j

right ?

No, compiling it like this will only produce output that is compatible only with CPUs.

8XXD8 commented 10 months ago

I dont use HIP_VISIBLE_DEVICES, just

make clean

and it works for me. And i run ./main without HIP_VISIBLE_DEVICES, and it uses both cards. Im on Rocm 5.7 and Ubuntu server 22.04.3

bojak83318 commented 10 months ago

I dont use HIP_VISIBLE_DEVICES, just

make clean

and it works for me. And i run ./main without HIP_VISIBLE_DEVICES, and it uses both cards. Im on Rocm 5.7 and Ubuntu server 22.04.3

Thanks its compiled as per gpu now

Log start
main: build = 1501 (a75fa57)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1699596835
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 ROCm devices:
  Device 0: Radeon RX 7900 XTX, compute capability 11.0
  Device 1: Radeon RX 7900 XT, compute capability 11.0

but ...

Write in python object class star wars and its charactersDesc##################################################################

still getting jibberish

bojak83318 commented 10 months ago

I dont use HIP_VISIBLE_DEVICES, just

make clean

and it works for me. And i run ./main without HIP_VISIBLE_DEVICES, and it uses both cards. Im on Rocm 5.7 and Ubuntu server 22.04.3

can you perhaps outline like this guide , from start to finish ...

bojak83318 commented 10 months ago

apparently nvidia also has this problem

ccbadd commented 10 months ago

HIP_VISIBLE_DEVICES is an environment variable read by ROCm during run time and has not reason to be part of the command line for compiling. Simply running "make LLAMA_HIPBLAS=1" is all that is required on my Ubuntu 22.04 server.

xangelix commented 10 months ago

Unclear if this is related, but I can't load any model at all on multigpu ROCm.

Segmentation fault after model load for ROCm multi-gpu, multi-gfx. Best I can remember it worked a couple months ago, but has now been broken at least 2 weeks.

Tested on: Arch Linux Kernel 6.5.9, ROCm 5.7.1, llamacpp

rocminfo ``` ROCk module is loaded ===================== HSA System Attributes ===================== Runtime Version: 1.1 System Timestamp Freq.: 1000.000000MHz Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) Machine Model: LARGE System Endianness: LITTLE Mwaitx: DISABLED DMAbuf Support: YES ========== HSA Agents ========== ******* Agent 1 ******* Name: AMD Ryzen 9 7950X 16-Core Processor Uuid: CPU-XX Marketing Name: AMD Ryzen 9 7950X 16-Core Processor Vendor Name: CPU Feature: None specified Profile: FULL_PROFILE Float Round Mode: NEAR Max Queue Number: 0(0x0) Queue Min Size: 0(0x0) Queue Max Size: 0(0x0) Queue Type: MULTI Node: 0 Device Type: CPU Cache Info: L1: 32768(0x8000) KB Chip ID: 0(0x0) ASIC Revision: 0(0x0) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 6021 BDFID: 0 Internal Node ID: 0 Compute Unit: 32 SIMDs per CU: 0 Shader Engines: 0 Shader Arrs. per Eng.: 0 WatchPts on Addr. Ranges:1 Features: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: FINE GRAINED Size: 65539100(0x3e80c1c) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 2 Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED Size: 65539100(0x3e80c1c) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 3 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 65539100(0x3e80c1c) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE ISA Info: ******* Agent 2 ******* Name: gfx1100 Uuid: GPU-28b5961221d81024 Marketing Name: AMD Radeon RX 7900 XTX Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128(0x80) Queue Min Size: 64(0x40) Queue Max Size: 131072(0x20000) Queue Type: MULTI Node: 1 Device Type: GPU Cache Info: L1: 32(0x20) KB L2: 6144(0x1800) KB L3: 98304(0x18000) KB Chip ID: 29772(0x744c) ASIC Revision: 0(0x0) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 2526 BDFID: 768 Internal Node ID: 1 Compute Unit: 96 SIMDs per CU: 2 Shader Engines: 6 Shader Arrs. per Eng.: 2 WatchPts on Addr. Ranges:4 Features: KERNEL_DISPATCH Fast F16 Operation: TRUE Wavefront Size: 32(0x20) Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Max Waves Per CU: 32(0x20) Max Work-item Per CU: 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) Max fbarriers/Workgrp: 32 Packet Processor uCode:: 528 SDMA engine uCode:: 19 IOMMU Support:: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 25149440(0x17fc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 2 Segment: GLOBAL; FLAGS: Size: 25149440(0x17fc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 3 Segment: GROUP Size: 64(0x40) KB Allocatable: FALSE Alloc Granule: 0KB Alloc Alignment: 0KB Accessible by all: FALSE ISA Info: ISA 1 Name: amdgcn-amd-amdhsa--gfx1100 Machine Models: HSA_MACHINE_MODEL_LARGE Profiles: HSA_PROFILE_BASE Default Rounding Mode: NEAR Default Rounding Mode: NEAR Fast f16: TRUE Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) FBarrier Max Size: 32 ******* Agent 3 ******* Name: gfx1030 Uuid: GPU-8de346d621abe448 Marketing Name: AMD Radeon RX 6900 XT Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128(0x80) Queue Min Size: 64(0x40) Queue Max Size: 131072(0x20000) Queue Type: MULTI Node: 2 Device Type: GPU Cache Info: L1: 16(0x10) KB L2: 4096(0x1000) KB L3: 131072(0x20000) KB Chip ID: 29615(0x73af) ASIC Revision: 1(0x1) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 2720 BDFID: 1792 Internal Node ID: 2 Compute Unit: 80 SIMDs per CU: 2 Shader Engines: 4 Shader Arrs. per Eng.: 2 WatchPts on Addr. Ranges:4 Features: KERNEL_DISPATCH Fast F16 Operation: TRUE Wavefront Size: 32(0x20) Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Max Waves Per CU: 32(0x20) Max Work-item Per CU: 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) Max fbarriers/Workgrp: 32 Packet Processor uCode:: 115 SDMA engine uCode:: 83 IOMMU Support:: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 16760832(0xffc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 2 Segment: GLOBAL; FLAGS: Size: 16760832(0xffc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 3 Segment: GROUP Size: 64(0x40) KB Allocatable: FALSE Alloc Granule: 0KB Alloc Alignment: 0KB Accessible by all: FALSE ISA Info: ISA 1 Name: amdgcn-amd-amdhsa--gfx1030 Machine Models: HSA_MACHINE_MODEL_LARGE Profiles: HSA_PROFILE_BASE Default Rounding Mode: NEAR Default Rounding Mode: NEAR Fast f16: TRUE Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) FBarrier Max Size: 32 *** Done *** ```
make LLAMA_HIPBLAS=1 ``` I llama.cpp build info: I UNAME_S: Linux I UNAME_P: unknown I UNAME_M: x86_64 I CFLAGS: -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -march=native -mtune=native I CXXFLAGS: -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native I NVCCFLAGS: -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-pedantic -Xcompiler "-Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native " I LDFLAGS: -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas I CC: cc (GCC) 13.2.1 20230801 I CXX: g++ (GCC) 13.2.1 20230801 cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -march=native -mtune=native -c ggml.c -o ggml.o g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native -c llama.cpp -o llama.o g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native -c common/common.cpp -o common.o g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native -c common/sampling.cpp -o sampling.o g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native -c common/grammar-parser.cpp -o grammar-parser.o g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native -c common/build-info.cpp -o build-info.o g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native -c common/console.cpp -o console.o /opt/rocm/bin/hipcc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native --offload-arch=gfx1100 --offload-arch=gfx1030 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -x hip -c -o ggml-cuda.o warning: unknown warning option '-Wno-format-truncation' [-Wunknown-warning-option] warning: cast from 'const signed char *' to 'unsigned short *' drops const qualifier [-Wcast-qual] const uint16_t * x16 = (uint16_t *) (x8 + sizeof(int) * i32); // assume at least 2 byte alignment ^ warning: cast from 'const unsigned char *' to 'unsigned short *' drops const qualifier [-Wcast-qual] const uint16_t * x16 = (uint16_t *) (x8 + sizeof(int) * i32); // assume at least 2 byte alignment ^ warning: cast from 'const signed char *' to 'int *' drops const qualifier [-Wcast-qual] return *((int *) (x8 + sizeof(int) * i32)); // assume at least 4 byte alignment ^ warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual] return *((int *) (x8 + sizeof(int) * i32)); // assume at least 4 byte alignment ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] static cudaStream_t g_cudaStreams[GGML_CUDA_MAX_DEVICES][MAX_STREAMS] = { nullptr }; ^~~~~~~ { } warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q4_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: unused parameter 'x_sc' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q4_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: cast from 'const void *' to 'block_q4_0 *' drops const qualifier [-Wcast-qual] const block_q4_0 * bx0 = (block_q4_0 *) vx; ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ warning: unused parameter 'x_sc' [-Wunused-parameter] int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) { ^ warning: cast from 'const __half2 *' to 'float *' drops const qualifier [-Wcast-qual] const float * x_dmf = (float *) x_dm; ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: unused parameter 'x_sc' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q4_1(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: unused parameter 'x_sc' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q4_1(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: cast from 'const void *' to 'block_q4_1 *' drops const qualifier [-Wcast-qual] const block_q4_1 * bx0 = (block_q4_1 *) vx; ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ warning: unused parameter 'x_sc' [-Wunused-parameter] int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) { ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: unused parameter 'x_sc' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q5_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: unused parameter 'x_sc' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q5_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: cast from 'const void *' to 'block_q5_0 *' drops const qualifier [-Wcast-qual] const block_q5_0 * bx0 = (block_q5_0 *) vx; ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ warning: unused parameter 'x_sc' [-Wunused-parameter] int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) { ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: unused parameter 'x_sc' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q5_1(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: unused parameter 'x_sc' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q5_1(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: cast from 'const void *' to 'block_q5_1 *' drops const qualifier [-Wcast-qual] const block_q5_1 * bx0 = (block_q5_1 *) vx; ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ warning: unused parameter 'x_sc' [-Wunused-parameter] int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) { ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: unused parameter 'x_sc' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q8_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: unused parameter 'x_sc' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q8_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: cast from 'const void *' to 'block_q8_0 *' drops const qualifier [-Wcast-qual] const block_q8_0 * bx0 = (block_q8_0 *) vx; ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ warning: unused parameter 'x_sc' [-Wunused-parameter] int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) { ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: unused parameter 'x_sc' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q2_K(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: cast from 'const void *' to 'block_q2_K *' drops const qualifier [-Wcast-qual] const block_q2_K * bx0 = (block_q2_K *) vx; ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: cast from 'const void *' to 'block_q3_K *' drops const qualifier [-Wcast-qual] const block_q3_K * bx0 = (block_q3_K *) vx; ^ warning: cast from 'const int *' to 'signed char *' drops const qualifier [-Wcast-qual] const int8_t * scales = ((int8_t *) (x_sc + i * (WARP_SIZE/4) + i/4 + kbx*4)) + ky/4; ^ warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q4_K(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: cast from 'const void *' to 'block_q4_K *' drops const qualifier [-Wcast-qual] const block_q4_K * bx0 = (block_q4_K *) vx; ^ warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual] const int * scales = (int *) bxi->scales; ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q5_K(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: cast from 'const void *' to 'block_q5_K *' drops const qualifier [-Wcast-qual] const block_q5_K * bx0 = (block_q5_K *) vx; ^ warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual] const int * scales = (int *) bxi->scales; ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q6_K(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: cast from 'const void *' to 'block_q6_K *' drops const qualifier [-Wcast-qual] const block_q6_K * bx0 = (block_q6_K *) vx; ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: unused variable 'size_src0_ddq' [-Wunused-variable] const size_t size_src0_ddq = split ? (row_high[id]-row_low[id])*ne00 * src0_ts/src0_bs : ggml_nbytes(src0); ^ warning: unused variable 'nrows0' [-Wunused-variable] const int64_t nrows0 = ggml_nrows(src0); ^ warning: function 'ggml_backend_cuda_graph_plan_free' could be declared with attribute 'noreturn' [-Wmissing-noreturn] static void ggml_backend_cuda_graph_plan_free(ggml_backend_t backend, ggml_backend_graph_plan_t plan) { ^ warning: function 'ggml_backend_cuda_graph_plan_compute' could be declared with attribute 'noreturn' [-Wmissing-noreturn] static void ggml_backend_cuda_graph_plan_compute(ggml_backend_t backend, ggml_backend_graph_plan_t plan) { ^ warning: cast from 'const void *' to 'block_q4_0 *' drops const qualifier [-Wcast-qual] const block_q4_0 * bx0 = (block_q4_0 *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q4_0<128, 8, false>' requested here load_tiles_q4_0, VDR_Q4_0_Q8_1_MMQ, vec_dot_q4_0_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q4_0' requested here mul_mat_q4_0<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q4_0, 64, 128, 8, &allocate_tiles_q4_0, &load_tiles_q4_0, 4, &vec_dot_q4_0_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q4_0' requested here mul_mat_q4_0<<>> ^ warning: cast from 'const void *' to 'block_q4_0 *' drops const qualifier [-Wcast-qual] const block_q4_0 * bx0 = (block_q4_0 *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q4_0<128, 8, true>' requested here load_tiles_q4_0, VDR_Q4_0_Q8_1_MMQ, vec_dot_q4_0_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q4_0' requested here mul_mat_q4_0<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q4_0, 64, 128, 8, &allocate_tiles_q4_0, &load_tiles_q4_0, 4, &vec_dot_q4_0_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q4_0' requested here mul_mat_q4_0<<>> ^ warning: cast from 'const void *' to 'block_q4_1 *' drops const qualifier [-Wcast-qual] const block_q4_1 * bx0 = (block_q4_1 *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q4_1<128, 8, false>' requested here load_tiles_q4_1, VDR_Q4_1_Q8_1_MMQ, vec_dot_q4_1_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q4_1' requested here mul_mat_q4_1<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q4_1, 64, 128, 8, &allocate_tiles_q4_1, &load_tiles_q4_1, 4, &vec_dot_q4_1_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q4_1' requested here mul_mat_q4_1<<>> ^ warning: cast from 'const void *' to 'block_q4_1 *' drops const qualifier [-Wcast-qual] const block_q4_1 * bx0 = (block_q4_1 *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q4_1<128, 8, true>' requested here load_tiles_q4_1, VDR_Q4_1_Q8_1_MMQ, vec_dot_q4_1_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q4_1' requested here mul_mat_q4_1<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q4_1, 64, 128, 8, &allocate_tiles_q4_1, &load_tiles_q4_1, 4, &vec_dot_q4_1_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q4_1' requested here mul_mat_q4_1<<>> ^ warning: cast from 'const void *' to 'block_q5_0 *' drops const qualifier [-Wcast-qual] const block_q5_0 * bx0 = (block_q5_0 *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q5_0<128, 8, false>' requested here load_tiles_q5_0, VDR_Q5_0_Q8_1_MMQ, vec_dot_q5_0_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q5_0' requested here mul_mat_q5_0<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, false, block_q5_0, 64, 128, 8, &allocate_tiles_q5_0, &load_tiles_q5_0, 4, &vec_dot_q5_0_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q5_0' requested here mul_mat_q5_0<<>> ^ warning: cast from 'const void *' to 'block_q5_0 *' drops const qualifier [-Wcast-qual] const block_q5_0 * bx0 = (block_q5_0 *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q5_0<128, 8, true>' requested here load_tiles_q5_0, VDR_Q5_0_Q8_1_MMQ, vec_dot_q5_0_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q5_0' requested here mul_mat_q5_0<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, false, block_q5_0, 64, 128, 8, &allocate_tiles_q5_0, &load_tiles_q5_0, 4, &vec_dot_q5_0_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q5_0' requested here mul_mat_q5_0<<>> ^ warning: cast from 'const void *' to 'block_q5_1 *' drops const qualifier [-Wcast-qual] const block_q5_1 * bx0 = (block_q5_1 *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q5_1<128, 8, false>' requested here load_tiles_q5_1, VDR_Q5_1_Q8_1_MMQ, vec_dot_q5_1_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q5_1' requested here mul_mat_q5_1<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q5_1, 64, 128, 8, &allocate_tiles_q5_1, &load_tiles_q5_1, 4, &vec_dot_q5_1_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q5_1' requested here mul_mat_q5_1<<>> ^ warning: cast from 'const void *' to 'block_q5_1 *' drops const qualifier [-Wcast-qual] const block_q5_1 * bx0 = (block_q5_1 *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q5_1<128, 8, true>' requested here load_tiles_q5_1, VDR_Q5_1_Q8_1_MMQ, vec_dot_q5_1_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q5_1' requested here mul_mat_q5_1<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q5_1, 64, 128, 8, &allocate_tiles_q5_1, &load_tiles_q5_1, 4, &vec_dot_q5_1_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q5_1' requested here mul_mat_q5_1<<>> ^ warning: cast from 'const void *' to 'block_q8_0 *' drops const qualifier [-Wcast-qual] const block_q8_0 * bx0 = (block_q8_0 *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q8_0<128, 8, false>' requested here load_tiles_q8_0, VDR_Q8_0_Q8_1_MMQ, vec_dot_q8_0_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q8_0' requested here mul_mat_q8_0<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<32, 1, 8, false, block_q8_0, 64, 128, 8, &allocate_tiles_q8_0, &load_tiles_q8_0, 8, &vec_dot_q8_0_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q8_0' requested here mul_mat_q8_0<<>> ^ warning: cast from 'const void *' to 'block_q8_0 *' drops const qualifier [-Wcast-qual] const block_q8_0 * bx0 = (block_q8_0 *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q8_0<128, 8, true>' requested here load_tiles_q8_0, VDR_Q8_0_Q8_1_MMQ, vec_dot_q8_0_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q8_0' requested here mul_mat_q8_0<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<32, 1, 8, false, block_q8_0, 64, 128, 8, &allocate_tiles_q8_0, &load_tiles_q8_0, 8, &vec_dot_q8_0_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q8_0' requested here mul_mat_q8_0<<>> ^ warning: cast from 'const void *' to 'block_q2_K *' drops const qualifier [-Wcast-qual] const block_q2_K * bx0 = (block_q2_K *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q2_K<128, 8, false>' requested here load_tiles_q2_K, VDR_Q2_K_Q8_1_MMQ, vec_dot_q2_K_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q2_K' requested here mul_mat_q2_K<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<256, 4, 16, false, block_q2_K, 64, 128, 8, &allocate_tiles_q2_K, &load_tiles_q2_K, 2, &vec_dot_q2_K_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q2_K' requested here mul_mat_q2_K<<>> ^ warning: cast from 'const void *' to 'block_q2_K *' drops const qualifier [-Wcast-qual] const block_q2_K * bx0 = (block_q2_K *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q2_K<128, 8, true>' requested here load_tiles_q2_K, VDR_Q2_K_Q8_1_MMQ, vec_dot_q2_K_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q2_K' requested here mul_mat_q2_K<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<256, 4, 16, false, block_q2_K, 64, 128, 8, &allocate_tiles_q2_K, &load_tiles_q2_K, 2, &vec_dot_q2_K_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q2_K' requested here mul_mat_q2_K<<>> ^ warning: cast from 'const void *' to 'block_q3_K *' drops const qualifier [-Wcast-qual] const block_q3_K * bx0 = (block_q3_K *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q3_K<64, 8, false>' requested here load_tiles_q3_K, VDR_Q3_K_Q8_1_MMQ, vec_dot_q3_K_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q3_K' requested here mul_mat_q3_K<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<256, 4, 16, false, block_q3_K, 128, 64, 8, &allocate_tiles_q3_K, &load_tiles_q3_K, 2, &vec_dot_q3_K_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q3_K' requested here mul_mat_q3_K<<>> ^ warning: cast from 'const void *' to 'block_q3_K *' drops const qualifier [-Wcast-qual] const block_q3_K * bx0 = (block_q3_K *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q3_K<64, 8, true>' requested here load_tiles_q3_K, VDR_Q3_K_Q8_1_MMQ, vec_dot_q3_K_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q3_K' requested here mul_mat_q3_K<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<256, 4, 16, false, block_q3_K, 128, 64, 8, &allocate_tiles_q3_K, &load_tiles_q3_K, 2, &vec_dot_q3_K_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q3_K' requested here mul_mat_q3_K<<>> ^ warning: cast from 'const void *' to 'block_q4_K *' drops const qualifier [-Wcast-qual] const block_q4_K * bx0 = (block_q4_K *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q4_K<128, 8, false>' requested here load_tiles_q4_K, VDR_Q4_K_Q8_1_MMQ, vec_dot_q4_K_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q4_K' requested here mul_mat_q4_K<<>> ^ warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual] const int * scales = (int *) bxi->scales; ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, true, block_q4_K, 64, 128, 8, &allocate_tiles_q4_K, &load_tiles_q4_K, 8, &vec_dot_q4_K_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q4_K' requested here mul_mat_q4_K<<>> ^ warning: cast from 'const void *' to 'block_q4_K *' drops const qualifier [-Wcast-qual] const block_q4_K * bx0 = (block_q4_K *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q4_K<128, 8, true>' requested here load_tiles_q4_K, VDR_Q4_K_Q8_1_MMQ, vec_dot_q4_K_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q4_K' requested here mul_mat_q4_K<<>> ^ warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual] const int * scales = (int *) bxi->scales; ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, true, block_q4_K, 64, 128, 8, &allocate_tiles_q4_K, &load_tiles_q4_K, 8, &vec_dot_q4_K_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q4_K' requested here mul_mat_q4_K<<>> ^ warning: cast from 'const void *' to 'block_q5_K *' drops const qualifier [-Wcast-qual] const block_q5_K * bx0 = (block_q5_K *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q5_K<128, 8, false>' requested here load_tiles_q5_K, VDR_Q5_K_Q8_1_MMQ, vec_dot_q5_K_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q5_K' requested here mul_mat_q5_K<<>> ^ warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual] const int * scales = (int *) bxi->scales; ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, true, block_q5_K, 64, 128, 8, &allocate_tiles_q5_K, &load_tiles_q5_K, 8, &vec_dot_q5_K_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q5_K' requested here mul_mat_q5_K<<>> ^ warning: cast from 'const void *' to 'block_q5_K *' drops const qualifier [-Wcast-qual] const block_q5_K * bx0 = (block_q5_K *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q5_K<128, 8, true>' requested here load_tiles_q5_K, VDR_Q5_K_Q8_1_MMQ, vec_dot_q5_K_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q5_K' requested here mul_mat_q5_K<<>> ^ warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual] const int * scales = (int *) bxi->scales; ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, true, block_q5_K, 64, 128, 8, &allocate_tiles_q5_K, &load_tiles_q5_K, 8, &vec_dot_q5_K_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q5_K' requested here mul_mat_q5_K<<>> ^ warning: cast from 'const void *' to 'block_q6_K *' drops const qualifier [-Wcast-qual] const block_q6_K * bx0 = (block_q6_K *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q6_K<128, 8, false>' requested here load_tiles_q6_K, VDR_Q6_K_Q8_1_MMQ, vec_dot_q6_K_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q6_K' requested here mul_mat_q6_K<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, false, block_q6_K, 64, 128, 8, &allocate_tiles_q6_K, &load_tiles_q6_K, 8, &vec_dot_q6_K_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q6_K' requested here mul_mat_q6_K<<>> ^ warning: cast from 'const void *' to 'block_q6_K *' drops const qualifier [-Wcast-qual] const block_q6_K * bx0 = (block_q6_K *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q6_K<128, 8, true>' requested here load_tiles_q6_K, VDR_Q6_K_Q8_1_MMQ, vec_dot_q6_K_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q6_K' requested here mul_mat_q6_K<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, false, block_q6_K, 64, 128, 8, &allocate_tiles_q6_K, &load_tiles_q6_K, 8, &vec_dot_q6_K_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q6_K' requested here mul_mat_q6_K<<>> ^ 110 warnings generated when compiling for gfx1030. warning: unknown warning option '-Wno-format-truncation' [-Wunknown-warning-option] warning: cast from 'const signed char *' to 'unsigned short *' drops const qualifier [-Wcast-qual] const uint16_t * x16 = (uint16_t *) (x8 + sizeof(int) * i32); // assume at least 2 byte alignment ^ warning: cast from 'const unsigned char *' to 'unsigned short *' drops const qualifier [-Wcast-qual] const uint16_t * x16 = (uint16_t *) (x8 + sizeof(int) * i32); // assume at least 2 byte alignment ^ warning: cast from 'const signed char *' to 'int *' drops const qualifier [-Wcast-qual] return *((int *) (x8 + sizeof(int) * i32)); // assume at least 4 byte alignment ^ warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual] return *((int *) (x8 + sizeof(int) * i32)); // assume at least 4 byte alignment ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] static cudaStream_t g_cudaStreams[GGML_CUDA_MAX_DEVICES][MAX_STREAMS] = { nullptr }; ^~~~~~~ { } warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q4_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: unused parameter 'x_sc' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q4_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: cast from 'const void *' to 'block_q4_0 *' drops const qualifier [-Wcast-qual] const block_q4_0 * bx0 = (block_q4_0 *) vx; ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ warning: unused parameter 'x_sc' [-Wunused-parameter] int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) { ^ warning: cast from 'const __half2 *' to 'float *' drops const qualifier [-Wcast-qual] const float * x_dmf = (float *) x_dm; ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: unused parameter 'x_sc' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q4_1(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: unused parameter 'x_sc' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q4_1(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: cast from 'const void *' to 'block_q4_1 *' drops const qualifier [-Wcast-qual] const block_q4_1 * bx0 = (block_q4_1 *) vx; ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ warning: unused parameter 'x_sc' [-Wunused-parameter] int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) { ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: unused parameter 'x_sc' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q5_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: unused parameter 'x_sc' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q5_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: cast from 'const void *' to 'block_q5_0 *' drops const qualifier [-Wcast-qual] const block_q5_0 * bx0 = (block_q5_0 *) vx; ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ warning: unused parameter 'x_sc' [-Wunused-parameter] int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) { ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: unused parameter 'x_sc' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q5_1(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: unused parameter 'x_sc' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q5_1(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: cast from 'const void *' to 'block_q5_1 *' drops const qualifier [-Wcast-qual] const block_q5_1 * bx0 = (block_q5_1 *) vx; ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ warning: unused parameter 'x_sc' [-Wunused-parameter] int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) { ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: unused parameter 'x_sc' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q8_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: unused parameter 'x_sc' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q8_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: cast from 'const void *' to 'block_q8_0 *' drops const qualifier [-Wcast-qual] const block_q8_0 * bx0 = (block_q8_0 *) vx; ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ warning: unused parameter 'x_sc' [-Wunused-parameter] int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) { ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: unused parameter 'x_sc' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q2_K(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: cast from 'const void *' to 'block_q2_K *' drops const qualifier [-Wcast-qual] const block_q2_K * bx0 = (block_q2_K *) vx; ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: cast from 'const void *' to 'block_q3_K *' drops const qualifier [-Wcast-qual] const block_q3_K * bx0 = (block_q3_K *) vx; ^ warning: cast from 'const int *' to 'signed char *' drops const qualifier [-Wcast-qual] const int8_t * scales = ((int8_t *) (x_sc + i * (WARP_SIZE/4) + i/4 + kbx*4)) + ky/4; ^ warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q4_K(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: cast from 'const void *' to 'block_q4_K *' drops const qualifier [-Wcast-qual] const block_q4_K * bx0 = (block_q4_K *) vx; ^ warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual] const int * scales = (int *) bxi->scales; ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q5_K(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: cast from 'const void *' to 'block_q5_K *' drops const qualifier [-Wcast-qual] const block_q5_K * bx0 = (block_q5_K *) vx; ^ warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual] const int * scales = (int *) bxi->scales; ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q6_K(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: cast from 'const void *' to 'block_q6_K *' drops const qualifier [-Wcast-qual] const block_q6_K * bx0 = (block_q6_K *) vx; ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: unused variable 'size_src0_ddq' [-Wunused-variable] const size_t size_src0_ddq = split ? (row_high[id]-row_low[id])*ne00 * src0_ts/src0_bs : ggml_nbytes(src0); ^ warning: unused variable 'nrows0' [-Wunused-variable] const int64_t nrows0 = ggml_nrows(src0); ^ warning: function 'ggml_backend_cuda_graph_plan_free' could be declared with attribute 'noreturn' [-Wmissing-noreturn] static void ggml_backend_cuda_graph_plan_free(ggml_backend_t backend, ggml_backend_graph_plan_t plan) { ^ warning: function 'ggml_backend_cuda_graph_plan_compute' could be declared with attribute 'noreturn' [-Wmissing-noreturn] static void ggml_backend_cuda_graph_plan_compute(ggml_backend_t backend, ggml_backend_graph_plan_t plan) { ^ warning: cast from 'const void *' to 'block_q4_0 *' drops const qualifier [-Wcast-qual] const block_q4_0 * bx0 = (block_q4_0 *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q4_0<128, 8, false>' requested here load_tiles_q4_0, VDR_Q4_0_Q8_1_MMQ, vec_dot_q4_0_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q4_0' requested here mul_mat_q4_0<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q4_0, 64, 128, 8, &allocate_tiles_q4_0, &load_tiles_q4_0, 4, &vec_dot_q4_0_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q4_0' requested here mul_mat_q4_0<<>> ^ warning: cast from 'const void *' to 'block_q4_0 *' drops const qualifier [-Wcast-qual] const block_q4_0 * bx0 = (block_q4_0 *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q4_0<128, 8, true>' requested here load_tiles_q4_0, VDR_Q4_0_Q8_1_MMQ, vec_dot_q4_0_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q4_0' requested here mul_mat_q4_0<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q4_0, 64, 128, 8, &allocate_tiles_q4_0, &load_tiles_q4_0, 4, &vec_dot_q4_0_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q4_0' requested here mul_mat_q4_0<<>> ^ warning: cast from 'const void *' to 'block_q4_1 *' drops const qualifier [-Wcast-qual] const block_q4_1 * bx0 = (block_q4_1 *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q4_1<128, 8, false>' requested here load_tiles_q4_1, VDR_Q4_1_Q8_1_MMQ, vec_dot_q4_1_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q4_1' requested here mul_mat_q4_1<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q4_1, 64, 128, 8, &allocate_tiles_q4_1, &load_tiles_q4_1, 4, &vec_dot_q4_1_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q4_1' requested here mul_mat_q4_1<<>> ^ warning: cast from 'const void *' to 'block_q4_1 *' drops const qualifier [-Wcast-qual] const block_q4_1 * bx0 = (block_q4_1 *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q4_1<128, 8, true>' requested here load_tiles_q4_1, VDR_Q4_1_Q8_1_MMQ, vec_dot_q4_1_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q4_1' requested here mul_mat_q4_1<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q4_1, 64, 128, 8, &allocate_tiles_q4_1, &load_tiles_q4_1, 4, &vec_dot_q4_1_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q4_1' requested here mul_mat_q4_1<<>> ^ warning: cast from 'const void *' to 'block_q5_0 *' drops const qualifier [-Wcast-qual] const block_q5_0 * bx0 = (block_q5_0 *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q5_0<128, 8, false>' requested here load_tiles_q5_0, VDR_Q5_0_Q8_1_MMQ, vec_dot_q5_0_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q5_0' requested here mul_mat_q5_0<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, false, block_q5_0, 64, 128, 8, &allocate_tiles_q5_0, &load_tiles_q5_0, 4, &vec_dot_q5_0_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q5_0' requested here mul_mat_q5_0<<>> ^ warning: cast from 'const void *' to 'block_q5_0 *' drops const qualifier [-Wcast-qual] const block_q5_0 * bx0 = (block_q5_0 *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q5_0<128, 8, true>' requested here load_tiles_q5_0, VDR_Q5_0_Q8_1_MMQ, vec_dot_q5_0_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q5_0' requested here mul_mat_q5_0<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, false, block_q5_0, 64, 128, 8, &allocate_tiles_q5_0, &load_tiles_q5_0, 4, &vec_dot_q5_0_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q5_0' requested here mul_mat_q5_0<<>> ^ warning: cast from 'const void *' to 'block_q5_1 *' drops const qualifier [-Wcast-qual] const block_q5_1 * bx0 = (block_q5_1 *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q5_1<128, 8, false>' requested here load_tiles_q5_1, VDR_Q5_1_Q8_1_MMQ, vec_dot_q5_1_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q5_1' requested here mul_mat_q5_1<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q5_1, 64, 128, 8, &allocate_tiles_q5_1, &load_tiles_q5_1, 4, &vec_dot_q5_1_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q5_1' requested here mul_mat_q5_1<<>> ^ warning: cast from 'const void *' to 'block_q5_1 *' drops const qualifier [-Wcast-qual] const block_q5_1 * bx0 = (block_q5_1 *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q5_1<128, 8, true>' requested here load_tiles_q5_1, VDR_Q5_1_Q8_1_MMQ, vec_dot_q5_1_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q5_1' requested here mul_mat_q5_1<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q5_1, 64, 128, 8, &allocate_tiles_q5_1, &load_tiles_q5_1, 4, &vec_dot_q5_1_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q5_1' requested here mul_mat_q5_1<<>> ^ warning: cast from 'const void *' to 'block_q8_0 *' drops const qualifier [-Wcast-qual] const block_q8_0 * bx0 = (block_q8_0 *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q8_0<128, 8, false>' requested here load_tiles_q8_0, VDR_Q8_0_Q8_1_MMQ, vec_dot_q8_0_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q8_0' requested here mul_mat_q8_0<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<32, 1, 8, false, block_q8_0, 64, 128, 8, &allocate_tiles_q8_0, &load_tiles_q8_0, 8, &vec_dot_q8_0_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q8_0' requested here mul_mat_q8_0<<>> ^ warning: cast from 'const void *' to 'block_q8_0 *' drops const qualifier [-Wcast-qual] const block_q8_0 * bx0 = (block_q8_0 *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q8_0<128, 8, true>' requested here load_tiles_q8_0, VDR_Q8_0_Q8_1_MMQ, vec_dot_q8_0_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q8_0' requested here mul_mat_q8_0<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<32, 1, 8, false, block_q8_0, 64, 128, 8, &allocate_tiles_q8_0, &load_tiles_q8_0, 8, &vec_dot_q8_0_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q8_0' requested here mul_mat_q8_0<<>> ^ warning: cast from 'const void *' to 'block_q2_K *' drops const qualifier [-Wcast-qual] const block_q2_K * bx0 = (block_q2_K *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q2_K<128, 8, false>' requested here load_tiles_q2_K, VDR_Q2_K_Q8_1_MMQ, vec_dot_q2_K_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q2_K' requested here mul_mat_q2_K<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<256, 4, 16, false, block_q2_K, 64, 128, 8, &allocate_tiles_q2_K, &load_tiles_q2_K, 2, &vec_dot_q2_K_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q2_K' requested here mul_mat_q2_K<<>> ^ warning: cast from 'const void *' to 'block_q2_K *' drops const qualifier [-Wcast-qual] const block_q2_K * bx0 = (block_q2_K *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q2_K<128, 8, true>' requested here load_tiles_q2_K, VDR_Q2_K_Q8_1_MMQ, vec_dot_q2_K_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q2_K' requested here mul_mat_q2_K<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<256, 4, 16, false, block_q2_K, 64, 128, 8, &allocate_tiles_q2_K, &load_tiles_q2_K, 2, &vec_dot_q2_K_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q2_K' requested here mul_mat_q2_K<<>> ^ warning: cast from 'const void *' to 'block_q3_K *' drops const qualifier [-Wcast-qual] const block_q3_K * bx0 = (block_q3_K *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q3_K<64, 8, false>' requested here load_tiles_q3_K, VDR_Q3_K_Q8_1_MMQ, vec_dot_q3_K_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q3_K' requested here mul_mat_q3_K<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<256, 4, 16, false, block_q3_K, 128, 64, 8, &allocate_tiles_q3_K, &load_tiles_q3_K, 2, &vec_dot_q3_K_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q3_K' requested here mul_mat_q3_K<<>> ^ warning: cast from 'const void *' to 'block_q3_K *' drops const qualifier [-Wcast-qual] const block_q3_K * bx0 = (block_q3_K *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q3_K<64, 8, true>' requested here load_tiles_q3_K, VDR_Q3_K_Q8_1_MMQ, vec_dot_q3_K_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q3_K' requested here mul_mat_q3_K<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<256, 4, 16, false, block_q3_K, 128, 64, 8, &allocate_tiles_q3_K, &load_tiles_q3_K, 2, &vec_dot_q3_K_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q3_K' requested here mul_mat_q3_K<<>> ^ warning: cast from 'const void *' to 'block_q4_K *' drops const qualifier [-Wcast-qual] const block_q4_K * bx0 = (block_q4_K *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q4_K<128, 8, false>' requested here load_tiles_q4_K, VDR_Q4_K_Q8_1_MMQ, vec_dot_q4_K_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q4_K' requested here mul_mat_q4_K<<>> ^ warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual] const int * scales = (int *) bxi->scales; ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, true, block_q4_K, 64, 128, 8, &allocate_tiles_q4_K, &load_tiles_q4_K, 8, &vec_dot_q4_K_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q4_K' requested here mul_mat_q4_K<<>> ^ warning: cast from 'const void *' to 'block_q4_K *' drops const qualifier [-Wcast-qual] const block_q4_K * bx0 = (block_q4_K *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q4_K<128, 8, true>' requested here load_tiles_q4_K, VDR_Q4_K_Q8_1_MMQ, vec_dot_q4_K_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q4_K' requested here mul_mat_q4_K<<>> ^ warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual] const int * scales = (int *) bxi->scales; ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, true, block_q4_K, 64, 128, 8, &allocate_tiles_q4_K, &load_tiles_q4_K, 8, &vec_dot_q4_K_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q4_K' requested here mul_mat_q4_K<<>> ^ warning: cast from 'const void *' to 'block_q5_K *' drops const qualifier [-Wcast-qual] const block_q5_K * bx0 = (block_q5_K *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q5_K<128, 8, false>' requested here load_tiles_q5_K, VDR_Q5_K_Q8_1_MMQ, vec_dot_q5_K_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q5_K' requested here mul_mat_q5_K<<>> ^ warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual] const int * scales = (int *) bxi->scales; ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, true, block_q5_K, 64, 128, 8, &allocate_tiles_q5_K, &load_tiles_q5_K, 8, &vec_dot_q5_K_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q5_K' requested here mul_mat_q5_K<<>> ^ warning: cast from 'const void *' to 'block_q5_K *' drops const qualifier [-Wcast-qual] const block_q5_K * bx0 = (block_q5_K *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q5_K<128, 8, true>' requested here load_tiles_q5_K, VDR_Q5_K_Q8_1_MMQ, vec_dot_q5_K_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q5_K' requested here mul_mat_q5_K<<>> ^ warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual] const int * scales = (int *) bxi->scales; ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, true, block_q5_K, 64, 128, 8, &allocate_tiles_q5_K, &load_tiles_q5_K, 8, &vec_dot_q5_K_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q5_K' requested here mul_mat_q5_K<<>> ^ warning: cast from 'const void *' to 'block_q6_K *' drops const qualifier [-Wcast-qual] const block_q6_K * bx0 = (block_q6_K *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q6_K<128, 8, false>' requested here load_tiles_q6_K, VDR_Q6_K_Q8_1_MMQ, vec_dot_q6_K_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q6_K' requested here mul_mat_q6_K<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, false, block_q6_K, 64, 128, 8, &allocate_tiles_q6_K, &load_tiles_q6_K, 8, &vec_dot_q6_K_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q6_K' requested here mul_mat_q6_K<<>> ^ warning: cast from 'const void *' to 'block_q6_K *' drops const qualifier [-Wcast-qual] const block_q6_K * bx0 = (block_q6_K *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q6_K<128, 8, true>' requested here load_tiles_q6_K, VDR_Q6_K_Q8_1_MMQ, vec_dot_q6_K_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q6_K' requested here mul_mat_q6_K<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, false, block_q6_K, 64, 128, 8, &allocate_tiles_q6_K, &load_tiles_q6_K, 8, &vec_dot_q6_K_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q6_K' requested here mul_mat_q6_K<<>> ^ 110 warnings generated when compiling for gfx1100. warning: unknown warning option '-Wno-format-truncation' [-Wunknown-warning-option] warning: cast from 'const signed char *' to 'unsigned short *' drops const qualifier [-Wcast-qual] const uint16_t * x16 = (uint16_t *) (x8 + sizeof(int) * i32); // assume at least 2 byte alignment ^ warning: cast from 'const unsigned char *' to 'unsigned short *' drops const qualifier [-Wcast-qual] const uint16_t * x16 = (uint16_t *) (x8 + sizeof(int) * i32); // assume at least 2 byte alignment ^ warning: cast from 'const signed char *' to 'int *' drops const qualifier [-Wcast-qual] return *((int *) (x8 + sizeof(int) * i32)); // assume at least 4 byte alignment ^ warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual] return *((int *) (x8 + sizeof(int) * i32)); // assume at least 4 byte alignment ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] static cudaStream_t g_cudaStreams[GGML_CUDA_MAX_DEVICES][MAX_STREAMS] = { nullptr }; ^~~~~~~ { } warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q4_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: unused parameter 'x_sc' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q4_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: cast from 'const void *' to 'block_q4_0 *' drops const qualifier [-Wcast-qual] const block_q4_0 * bx0 = (block_q4_0 *) vx; ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ warning: unused parameter 'x_sc' [-Wunused-parameter] int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) { ^ warning: cast from 'const __half2 *' to 'float *' drops const qualifier [-Wcast-qual] const float * x_dmf = (float *) x_dm; ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: unused parameter 'x_sc' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q4_1(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: unused parameter 'x_sc' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q4_1(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: cast from 'const void *' to 'block_q4_1 *' drops const qualifier [-Wcast-qual] const block_q4_1 * bx0 = (block_q4_1 *) vx; ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ warning: unused parameter 'x_sc' [-Wunused-parameter] int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) { ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: unused parameter 'x_sc' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q5_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: unused parameter 'x_sc' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q5_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: cast from 'const void *' to 'block_q5_0 *' drops const qualifier [-Wcast-qual] const block_q5_0 * bx0 = (block_q5_0 *) vx; ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ warning: unused parameter 'x_sc' [-Wunused-parameter] int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) { ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: unused parameter 'x_sc' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q5_1(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: unused parameter 'x_sc' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q5_1(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: cast from 'const void *' to 'block_q5_1 *' drops const qualifier [-Wcast-qual] const block_q5_1 * bx0 = (block_q5_1 *) vx; ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ warning: unused parameter 'x_sc' [-Wunused-parameter] int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) { ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: unused parameter 'x_sc' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q8_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: unused parameter 'x_sc' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q8_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: cast from 'const void *' to 'block_q8_0 *' drops const qualifier [-Wcast-qual] const block_q8_0 * bx0 = (block_q8_0 *) vx; ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ warning: unused parameter 'x_sc' [-Wunused-parameter] int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) { ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: unused parameter 'x_sc' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q2_K(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: cast from 'const void *' to 'block_q2_K *' drops const qualifier [-Wcast-qual] const block_q2_K * bx0 = (block_q2_K *) vx; ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: cast from 'const void *' to 'block_q3_K *' drops const qualifier [-Wcast-qual] const block_q3_K * bx0 = (block_q3_K *) vx; ^ warning: cast from 'const int *' to 'signed char *' drops const qualifier [-Wcast-qual] const int8_t * scales = ((int8_t *) (x_sc + i * (WARP_SIZE/4) + i/4 + kbx*4)) + ky/4; ^ warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q4_K(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: cast from 'const void *' to 'block_q4_K *' drops const qualifier [-Wcast-qual] const block_q4_K * bx0 = (block_q4_K *) vx; ^ warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual] const int * scales = (int *) bxi->scales; ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q5_K(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: cast from 'const void *' to 'block_q5_K *' drops const qualifier [-Wcast-qual] const block_q5_K * bx0 = (block_q5_K *) vx; ^ warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual] const int * scales = (int *) bxi->scales; ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: unused parameter 'x_qh' [-Wunused-parameter] template static __device__ __forceinline__ void allocate_tiles_q6_K(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) { ^ warning: cast from 'const void *' to 'block_q6_K *' drops const qualifier [-Wcast-qual] const block_q6_K * bx0 = (block_q6_K *) vx; ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh, ^ warning: unused parameter 'x_qh' [-Wunused-parameter] const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc, ^ warning: unused variable 'size_src0_ddq' [-Wunused-variable] const size_t size_src0_ddq = split ? (row_high[id]-row_low[id])*ne00 * src0_ts/src0_bs : ggml_nbytes(src0); ^ warning: unused variable 'nrows0' [-Wunused-variable] const int64_t nrows0 = ggml_nrows(src0); ^ warning: function 'ggml_backend_cuda_graph_plan_free' could be declared with attribute 'noreturn' [-Wmissing-noreturn] static void ggml_backend_cuda_graph_plan_free(ggml_backend_t backend, ggml_backend_graph_plan_t plan) { ^ warning: function 'ggml_backend_cuda_graph_plan_compute' could be declared with attribute 'noreturn' [-Wmissing-noreturn] static void ggml_backend_cuda_graph_plan_compute(ggml_backend_t backend, ggml_backend_graph_plan_t plan) { ^ warning: cast from 'const void *' to 'block_q4_0 *' drops const qualifier [-Wcast-qual] const block_q4_0 * bx0 = (block_q4_0 *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q4_0<64, 8, false>' requested here load_tiles_q4_0, VDR_Q4_0_Q8_1_MMQ, vec_dot_q4_0_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q4_0' requested here mul_mat_q4_0<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q4_0, 64, 64, 8, &allocate_tiles_q4_0, &load_tiles_q4_0, 4, &vec_dot_q4_0_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q4_0' requested here mul_mat_q4_0<<>> ^ warning: cast from 'const void *' to 'block_q4_0 *' drops const qualifier [-Wcast-qual] const block_q4_0 * bx0 = (block_q4_0 *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q4_0<64, 8, true>' requested here load_tiles_q4_0, VDR_Q4_0_Q8_1_MMQ, vec_dot_q4_0_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q4_0' requested here mul_mat_q4_0<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q4_0, 64, 64, 8, &allocate_tiles_q4_0, &load_tiles_q4_0, 4, &vec_dot_q4_0_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q4_0' requested here mul_mat_q4_0<<>> ^ warning: cast from 'const void *' to 'block_q4_1 *' drops const qualifier [-Wcast-qual] const block_q4_1 * bx0 = (block_q4_1 *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q4_1<64, 8, false>' requested here load_tiles_q4_1, VDR_Q4_1_Q8_1_MMQ, vec_dot_q4_1_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q4_1' requested here mul_mat_q4_1<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q4_1, 64, 64, 8, &allocate_tiles_q4_1, &load_tiles_q4_1, 4, &vec_dot_q4_1_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q4_1' requested here mul_mat_q4_1<<>> ^ warning: cast from 'const void *' to 'block_q4_1 *' drops const qualifier [-Wcast-qual] const block_q4_1 * bx0 = (block_q4_1 *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q4_1<64, 8, true>' requested here load_tiles_q4_1, VDR_Q4_1_Q8_1_MMQ, vec_dot_q4_1_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q4_1' requested here mul_mat_q4_1<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q4_1, 64, 64, 8, &allocate_tiles_q4_1, &load_tiles_q4_1, 4, &vec_dot_q4_1_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q4_1' requested here mul_mat_q4_1<<>> ^ warning: cast from 'const void *' to 'block_q5_0 *' drops const qualifier [-Wcast-qual] const block_q5_0 * bx0 = (block_q5_0 *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q5_0<64, 8, false>' requested here load_tiles_q5_0, VDR_Q5_0_Q8_1_MMQ, vec_dot_q5_0_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q5_0' requested here mul_mat_q5_0<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, false, block_q5_0, 64, 64, 8, &allocate_tiles_q5_0, &load_tiles_q5_0, 4, &vec_dot_q5_0_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q5_0' requested here mul_mat_q5_0<<>> ^ warning: cast from 'const void *' to 'block_q5_0 *' drops const qualifier [-Wcast-qual] const block_q5_0 * bx0 = (block_q5_0 *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q5_0<64, 8, true>' requested here load_tiles_q5_0, VDR_Q5_0_Q8_1_MMQ, vec_dot_q5_0_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q5_0' requested here mul_mat_q5_0<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, false, block_q5_0, 64, 64, 8, &allocate_tiles_q5_0, &load_tiles_q5_0, 4, &vec_dot_q5_0_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q5_0' requested here mul_mat_q5_0<<>> ^ warning: cast from 'const void *' to 'block_q5_1 *' drops const qualifier [-Wcast-qual] const block_q5_1 * bx0 = (block_q5_1 *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q5_1<64, 8, false>' requested here load_tiles_q5_1, VDR_Q5_1_Q8_1_MMQ, vec_dot_q5_1_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q5_1' requested here mul_mat_q5_1<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q5_1, 64, 64, 8, &allocate_tiles_q5_1, &load_tiles_q5_1, 4, &vec_dot_q5_1_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q5_1' requested here mul_mat_q5_1<<>> ^ warning: cast from 'const void *' to 'block_q5_1 *' drops const qualifier [-Wcast-qual] const block_q5_1 * bx0 = (block_q5_1 *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q5_1<64, 8, true>' requested here load_tiles_q5_1, VDR_Q5_1_Q8_1_MMQ, vec_dot_q5_1_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q5_1' requested here mul_mat_q5_1<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q5_1, 64, 64, 8, &allocate_tiles_q5_1, &load_tiles_q5_1, 4, &vec_dot_q5_1_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q5_1' requested here mul_mat_q5_1<<>> ^ warning: cast from 'const void *' to 'block_q8_0 *' drops const qualifier [-Wcast-qual] const block_q8_0 * bx0 = (block_q8_0 *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q8_0<64, 8, false>' requested here load_tiles_q8_0, VDR_Q8_0_Q8_1_MMQ, vec_dot_q8_0_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q8_0' requested here mul_mat_q8_0<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<32, 1, 8, false, block_q8_0, 64, 64, 8, &allocate_tiles_q8_0, &load_tiles_q8_0, 8, &vec_dot_q8_0_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q8_0' requested here mul_mat_q8_0<<>> ^ warning: cast from 'const void *' to 'block_q8_0 *' drops const qualifier [-Wcast-qual] const block_q8_0 * bx0 = (block_q8_0 *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q8_0<64, 8, true>' requested here load_tiles_q8_0, VDR_Q8_0_Q8_1_MMQ, vec_dot_q8_0_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q8_0' requested here mul_mat_q8_0<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<32, 1, 8, false, block_q8_0, 64, 64, 8, &allocate_tiles_q8_0, &load_tiles_q8_0, 8, &vec_dot_q8_0_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q8_0' requested here mul_mat_q8_0<<>> ^ warning: cast from 'const void *' to 'block_q2_K *' drops const qualifier [-Wcast-qual] const block_q2_K * bx0 = (block_q2_K *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q2_K<32, 8, false>' requested here load_tiles_q2_K, VDR_Q2_K_Q8_1_MMQ, vec_dot_q2_K_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q2_K' requested here mul_mat_q2_K<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<256, 4, 16, false, block_q2_K, 128, 32, 8, &allocate_tiles_q2_K, &load_tiles_q2_K, 2, &vec_dot_q2_K_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q2_K' requested here mul_mat_q2_K<<>> ^ warning: cast from 'const void *' to 'block_q2_K *' drops const qualifier [-Wcast-qual] const block_q2_K * bx0 = (block_q2_K *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q2_K<32, 8, true>' requested here load_tiles_q2_K, VDR_Q2_K_Q8_1_MMQ, vec_dot_q2_K_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q2_K' requested here mul_mat_q2_K<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<256, 4, 16, false, block_q2_K, 128, 32, 8, &allocate_tiles_q2_K, &load_tiles_q2_K, 2, &vec_dot_q2_K_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q2_K' requested here mul_mat_q2_K<<>> ^ warning: cast from 'const void *' to 'block_q3_K *' drops const qualifier [-Wcast-qual] const block_q3_K * bx0 = (block_q3_K *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q3_K<128, 8, false>' requested here load_tiles_q3_K, VDR_Q3_K_Q8_1_MMQ, vec_dot_q3_K_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q3_K' requested here mul_mat_q3_K<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<256, 4, 16, false, block_q3_K, 32, 128, 8, &allocate_tiles_q3_K, &load_tiles_q3_K, 2, &vec_dot_q3_K_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q3_K' requested here mul_mat_q3_K<<>> ^ warning: cast from 'const void *' to 'block_q3_K *' drops const qualifier [-Wcast-qual] const block_q3_K * bx0 = (block_q3_K *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q3_K<128, 8, true>' requested here load_tiles_q3_K, VDR_Q3_K_Q8_1_MMQ, vec_dot_q3_K_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q3_K' requested here mul_mat_q3_K<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<256, 4, 16, false, block_q3_K, 32, 128, 8, &allocate_tiles_q3_K, &load_tiles_q3_K, 2, &vec_dot_q3_K_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q3_K' requested here mul_mat_q3_K<<>> ^ warning: cast from 'const void *' to 'block_q4_K *' drops const qualifier [-Wcast-qual] const block_q4_K * bx0 = (block_q4_K *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q4_K<64, 8, false>' requested here load_tiles_q4_K, VDR_Q4_K_Q8_1_MMQ, vec_dot_q4_K_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q4_K' requested here mul_mat_q4_K<<>> ^ warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual] const int * scales = (int *) bxi->scales; ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, true, block_q4_K, 32, 64, 8, &allocate_tiles_q4_K, &load_tiles_q4_K, 8, &vec_dot_q4_K_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q4_K' requested here mul_mat_q4_K<<>> ^ warning: cast from 'const void *' to 'block_q4_K *' drops const qualifier [-Wcast-qual] const block_q4_K * bx0 = (block_q4_K *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q4_K<64, 8, true>' requested here load_tiles_q4_K, VDR_Q4_K_Q8_1_MMQ, vec_dot_q4_K_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q4_K' requested here mul_mat_q4_K<<>> ^ warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual] const int * scales = (int *) bxi->scales; ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, true, block_q4_K, 32, 64, 8, &allocate_tiles_q4_K, &load_tiles_q4_K, 8, &vec_dot_q4_K_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q4_K' requested here mul_mat_q4_K<<>> ^ warning: cast from 'const void *' to 'block_q5_K *' drops const qualifier [-Wcast-qual] const block_q5_K * bx0 = (block_q5_K *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q5_K<64, 8, false>' requested here load_tiles_q5_K, VDR_Q5_K_Q8_1_MMQ, vec_dot_q5_K_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q5_K' requested here mul_mat_q5_K<<>> ^ warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual] const int * scales = (int *) bxi->scales; ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, true, block_q5_K, 32, 64, 8, &allocate_tiles_q5_K, &load_tiles_q5_K, 8, &vec_dot_q5_K_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q5_K' requested here mul_mat_q5_K<<>> ^ warning: cast from 'const void *' to 'block_q5_K *' drops const qualifier [-Wcast-qual] const block_q5_K * bx0 = (block_q5_K *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q5_K<64, 8, true>' requested here load_tiles_q5_K, VDR_Q5_K_Q8_1_MMQ, vec_dot_q5_K_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q5_K' requested here mul_mat_q5_K<<>> ^ warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual] const int * scales = (int *) bxi->scales; ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, true, block_q5_K, 32, 64, 8, &allocate_tiles_q5_K, &load_tiles_q5_K, 8, &vec_dot_q5_K_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q5_K' requested here mul_mat_q5_K<<>> ^ warning: cast from 'const void *' to 'block_q6_K *' drops const qualifier [-Wcast-qual] const block_q6_K * bx0 = (block_q6_K *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q6_K<64, 8, false>' requested here load_tiles_q6_K, VDR_Q6_K_Q8_1_MMQ, vec_dot_q6_K_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q6_K' requested here mul_mat_q6_K<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, false, block_q6_K, 32, 64, 8, &allocate_tiles_q6_K, &load_tiles_q6_K, 8, &vec_dot_q6_K_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q6_K' requested here mul_mat_q6_K<<>> ^ warning: cast from 'const void *' to 'block_q6_K *' drops const qualifier [-Wcast-qual] const block_q6_K * bx0 = (block_q6_K *) vx; ^ note: in instantiation of function template specialization 'load_tiles_q6_K<64, 8, true>' requested here load_tiles_q6_K, VDR_Q6_K_Q8_1_MMQ, vec_dot_q6_K_q8_1_mul_mat> ^ note: in instantiation of function template specialization 'mul_mat_q6_K' requested here mul_mat_q6_K<<>> ^ warning: suggest braces around initialization of subobject [-Wmissing-braces] float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f}; ^~~~ { } note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, false, block_q6_K, 32, 64, 8, &allocate_tiles_q6_K, &load_tiles_q6_K, 8, &vec_dot_q6_K_q8_1_mul_mat>' requested here mul_mat_q, ^ note: in instantiation of function template specialization 'mul_mat_q6_K' requested here mul_mat_q6_K<<>> ^ 110 warnings generated when compiling for host. cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -march=native -mtune=native -c ggml-alloc.c -o ggml-alloc.o cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -march=native -mtune=native -c ggml-backend.c -o ggml-backend.o cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -march=native -mtune=native -c ggml-quants.c -o ggml-quants.o g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native examples/main/main.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o console.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o main -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas ==== Run ./main -h for help. ==== g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native examples/quantize/quantize.cpp build-info.o ggml.o llama.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o quantize -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native examples/quantize-stats/quantize-stats.cpp build-info.o ggml.o llama.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o quantize-stats -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native examples/perplexity/perplexity.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o perplexity -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native examples/embedding/embedding.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o embedding -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native pocs/vdot/vdot.cpp ggml.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o vdot -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native pocs/vdot/q8dot.cpp ggml.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o q8dot -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native -c common/train.cpp -o train.o g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native examples/train-text-from-scratch/train-text-from-scratch.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o train.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o train-text-from-scratch -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.cpp ggml.o llama.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o convert-llama2c-to-ggml -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native examples/simple/simple.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o simple -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native examples/batched/batched.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o batched -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native examples/batched-bench/batched-bench.cpp build-info.o ggml.o llama.o common.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o batched-bench -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native examples/save-load-state/save-load-state.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o save-load-state -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native -Iexamples/server examples/server/server.cpp examples/llava/clip.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o server -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas -Wno-cast-qual In copy constructor ‘task_result::task_result(const task_result&)’, inlined from ‘void std::__new_allocator<_Tp>::construct(_Up*, _Args&& ...) [with _Up = task_result; _Args = {const task_result&}; _Tp = task_result]’ at /usr/include/c++/13.2.1/bits/new_allocator.h:187:4, inlined from ‘static void std::allocator_traits >::construct(allocator_type&, _Up*, _Args&& ...) [with _Up = task_result; _Args = {const task_result&}; _Tp = task_result]’ at /usr/include/c++/13.2.1/bits/alloc_traits.h:537:17, inlined from ‘void std::vector<_Tp, _Alloc>::push_back(const value_type&) [with _Tp = task_result; _Alloc = std::allocator]’ at /usr/include/c++/13.2.1/bits/stl_vector.h:1283:30, inlined from ‘void llama_server_context::send_error(int, std::string)’ at examples/server/server.cpp:1097:32: examples/server/server.cpp:154:8: warning: ‘res.task_result::stop’ may be used uninitialized [-Wmaybe-uninitialized] 154 | struct task_result { | ^~~~~~~~~~~ examples/server/server.cpp: In member function ‘void llama_server_context::send_error(int, std::string)’: examples/server/server.cpp:1093:21: note: ‘res’ declared here 1093 | task_result res; | ^~~ In copy constructor ‘task_server::task_server(const task_server&)’, inlined from ‘void std::__new_allocator<_Tp>::construct(_Up*, _Args&& ...) [with _Up = task_server; _Args = {const task_server&}; _Tp = task_server]’ at /usr/include/c++/13.2.1/bits/new_allocator.h:187:4, inlined from ‘static void std::allocator_traits >::construct(allocator_type&, _Up*, _Args&& ...) [with _Up = task_server; _Args = {const task_server&}; _Tp = task_server]’ at /usr/include/c++/13.2.1/bits/alloc_traits.h:537:17, inlined from ‘void std::vector<_Tp, _Alloc>::push_back(const value_type&) [with _Tp = task_server; _Alloc = std::allocator]’ at /usr/include/c++/13.2.1/bits/stl_vector.h:1283:30, inlined from ‘int llama_server_context::request_completion(json, bool, bool)’ at examples/server/server.cpp:1259:30, inlined from ‘main(int, char**)::’ at examples/server/server.cpp:2333:61: examples/server/server.cpp:145:8: warning: ‘task.task_server::target_id’ may be used uninitialized [-Wmaybe-uninitialized] 145 | struct task_server { | ^~~~~~~~~~~ examples/server/server.cpp: In lambda function: examples/server/server.cpp:1253:21: note: ‘task’ declared here 1253 | task_server task; | ^~~~ In copy constructor ‘task_server::task_server(const task_server&)’, inlined from ‘void std::__new_allocator<_Tp>::construct(_Up*, _Args&& ...) [with _Up = task_server; _Args = {const task_server&}; _Tp = task_server]’ at /usr/include/c++/13.2.1/bits/new_allocator.h:187:4, inlined from ‘static void std::allocator_traits >::construct(allocator_type&, _Up*, _Args&& ...) [with _Up = task_server; _Args = {const task_server&}; _Tp = task_server]’ at /usr/include/c++/13.2.1/bits/alloc_traits.h:537:17, inlined from ‘void std::vector<_Tp, _Alloc>::push_back(const value_type&) [with _Tp = task_server; _Alloc = std::allocator]’ at /usr/include/c++/13.2.1/bits/stl_vector.h:1283:30, inlined from ‘int llama_server_context::request_completion(json, bool, bool)’ at examples/server/server.cpp:1259:30, inlined from ‘main(int, char**)::’ at examples/server/server.cpp:2388:61: examples/server/server.cpp:145:8: warning: ‘task.task_server::target_id’ may be used uninitialized [-Wmaybe-uninitialized] 145 | struct task_server { | ^~~~~~~~~~~ examples/server/server.cpp: In lambda function: examples/server/server.cpp:1253:21: note: ‘task’ declared here 1253 | task_server task; | ^~~~ In copy constructor ‘task_server::task_server(const task_server&)’, inlined from ‘void std::__new_allocator<_Tp>::construct(_Up*, _Args&& ...) [with _Up = task_server; _Args = {const task_server&}; _Tp = task_server]’ at /usr/include/c++/13.2.1/bits/new_allocator.h:187:4, inlined from ‘static void std::allocator_traits >::construct(allocator_type&, _Up*, _Args&& ...) [with _Up = task_server; _Args = {const task_server&}; _Tp = task_server]’ at /usr/include/c++/13.2.1/bits/alloc_traits.h:537:17, inlined from ‘void std::vector<_Tp, _Alloc>::push_back(const value_type&) [with _Tp = task_server; _Alloc = std::allocator]’ at /usr/include/c++/13.2.1/bits/stl_vector.h:1283:30, inlined from ‘int llama_server_context::request_completion(json, bool, bool)’ at examples/server/server.cpp:1259:30, inlined from ‘main(int, char**)::’ at examples/server/server.cpp:2492:61: examples/server/server.cpp:145:8: warning: ‘task.task_server::target_id’ may be used uninitialized [-Wmaybe-uninitialized] 145 | struct task_server { | ^~~~~~~~~~~ examples/server/server.cpp: In lambda function: examples/server/server.cpp:1253:21: note: ‘task’ declared here 1253 | task_server task; | ^~~~ g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native examples/gguf/gguf.cpp ggml.o llama.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o gguf -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native examples/llama-bench/llama-bench.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o llama-bench -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native -static -fPIC -c examples/llava/llava.cpp -o libllava.a -Wno-cast-qual g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native examples/llava/llava-cli.cpp examples/llava/clip.cpp examples/llava/llava.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o llava-cli -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas -Wno-cast-qual g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native examples/baby-llama/baby-llama.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o train.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o baby-llama -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native examples/beam-search/beam-search.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o beam-search -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native examples/speculative/speculative.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o speculative -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native examples/infill/infill.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o console.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o infill -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native examples/benchmark/benchmark-matmult.cpp build-info.o ggml.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o benchmark-matmult -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native examples/parallel/parallel.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o parallel -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native examples/finetune/finetune.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o train.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o finetune -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native examples/export-lora/export-lora.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o export-lora -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -march=native -mtune=native -c tests/test-c.c -o tests/test-c.o ```
./main -ngl 99 -m ../koboldcpp/models/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/mistral-7b-instruct-v0.1.Q5_K_M.gguf -mg 0 -p "Write a function in TypeScript that sums numbers" ``` Log start main: build = 1503 (4a4fd3e) main: built with cc (GCC) 13.2.1 20230801 for x86_64-pc-linux-gnu main: seed = 1699662201 ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 2 ROCm devices: Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0 Device 1: AMD Radeon RX 6900 XT, compute capability 10.3 llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from ../koboldcpp/models/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/mistral-7b-instruct-v0.1.Q5_K_M.gguf (version GGUF V2) llama_model_loader: - tensor 0: token_embd.weight q5_K [ 4096, 32000, 1, 1 ] llama_model_loader: - tensor 1: blk.0.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 2: blk.0.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 3: blk.0.attn_v.weight q6_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 4: blk.0.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 5: blk.0.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 6: blk.0.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 7: blk.0.ffn_down.weight q6_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 8: blk.0.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 9: blk.0.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 10: blk.1.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 11: blk.1.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 12: blk.1.attn_v.weight q6_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 13: blk.1.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 14: blk.1.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 15: blk.1.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 16: blk.1.ffn_down.weight q6_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 17: blk.1.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 18: blk.1.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 19: blk.2.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 20: blk.2.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 21: blk.2.attn_v.weight q6_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 22: blk.2.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 23: blk.2.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 24: blk.2.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 25: blk.2.ffn_down.weight q6_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 26: blk.2.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 27: blk.2.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 28: blk.3.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 29: blk.3.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 30: blk.3.attn_v.weight q6_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 31: blk.3.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 32: blk.3.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 33: blk.3.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 34: blk.3.ffn_down.weight q6_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 35: blk.3.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 36: blk.3.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 37: blk.4.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 38: blk.4.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 39: blk.4.attn_v.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 40: blk.4.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 41: blk.4.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 42: blk.4.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 43: blk.4.ffn_down.weight q5_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 44: blk.4.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 45: blk.4.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 46: blk.5.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 47: blk.5.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 48: blk.5.attn_v.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 49: blk.5.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 50: blk.5.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 51: blk.5.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 52: blk.5.ffn_down.weight q5_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 53: blk.5.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 54: blk.5.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 55: blk.6.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 56: blk.6.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 57: blk.6.attn_v.weight q6_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 58: blk.6.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 59: blk.6.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 60: blk.6.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 61: blk.6.ffn_down.weight q6_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 62: blk.6.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 63: blk.6.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 64: blk.7.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 65: blk.7.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 66: blk.7.attn_v.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 67: blk.7.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 68: blk.7.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 69: blk.7.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 70: blk.7.ffn_down.weight q5_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 71: blk.7.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 72: blk.7.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 73: blk.8.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 74: blk.8.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 75: blk.8.attn_v.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 76: blk.8.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 77: blk.8.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 78: blk.8.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 79: blk.8.ffn_down.weight q5_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 80: blk.8.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 81: blk.8.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 82: blk.9.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 83: blk.9.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 84: blk.9.attn_v.weight q6_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 85: blk.9.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 86: blk.9.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 87: blk.9.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 88: blk.9.ffn_down.weight q6_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 89: blk.9.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 90: blk.9.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 91: blk.10.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 92: blk.10.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 93: blk.10.attn_v.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 94: blk.10.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 95: blk.10.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 96: blk.10.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 97: blk.10.ffn_down.weight q5_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 98: blk.10.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 99: blk.10.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 100: blk.11.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 101: blk.11.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 102: blk.11.attn_v.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 103: blk.11.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 104: blk.11.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 105: blk.11.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 106: blk.11.ffn_down.weight q5_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 107: blk.11.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 108: blk.11.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 109: blk.12.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 110: blk.12.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 111: blk.12.attn_v.weight q6_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 112: blk.12.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 113: blk.12.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 114: blk.12.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 115: blk.12.ffn_down.weight q6_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 116: blk.12.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 117: blk.12.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 118: blk.13.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 119: blk.13.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 120: blk.13.attn_v.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 121: blk.13.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 122: blk.13.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 123: blk.13.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 124: blk.13.ffn_down.weight q5_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 125: blk.13.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 126: blk.13.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 127: blk.14.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 128: blk.14.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 129: blk.14.attn_v.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 130: blk.14.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 131: blk.14.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 132: blk.14.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 133: blk.14.ffn_down.weight q5_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 134: blk.14.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 135: blk.14.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 136: blk.15.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 137: blk.15.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 138: blk.15.attn_v.weight q6_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 139: blk.15.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 140: blk.15.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 141: blk.15.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 142: blk.15.ffn_down.weight q6_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 143: blk.15.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 144: blk.15.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 145: blk.16.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 146: blk.16.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 147: blk.16.attn_v.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 148: blk.16.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 149: blk.16.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 150: blk.16.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 151: blk.16.ffn_down.weight q5_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 152: blk.16.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 153: blk.16.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 154: blk.17.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 155: blk.17.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 156: blk.17.attn_v.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 157: blk.17.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 158: blk.17.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 159: blk.17.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 160: blk.17.ffn_down.weight q5_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 161: blk.17.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 162: blk.17.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 163: blk.18.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 164: blk.18.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 165: blk.18.attn_v.weight q6_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 166: blk.18.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 167: blk.18.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 168: blk.18.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 169: blk.18.ffn_down.weight q6_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 170: blk.18.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 171: blk.18.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 172: blk.19.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 173: blk.19.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 174: blk.19.attn_v.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 175: blk.19.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 176: blk.19.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 177: blk.19.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 178: blk.19.ffn_down.weight q5_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 179: blk.19.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 180: blk.19.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 181: blk.20.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 182: blk.20.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 183: blk.20.attn_v.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 184: blk.20.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 185: blk.20.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 186: blk.20.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 187: blk.20.ffn_down.weight q5_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 188: blk.20.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 189: blk.20.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 190: blk.21.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 191: blk.21.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 192: blk.21.attn_v.weight q6_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 193: blk.21.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 194: blk.21.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 195: blk.21.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 196: blk.21.ffn_down.weight q6_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 197: blk.21.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 198: blk.21.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 199: blk.22.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 200: blk.22.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 201: blk.22.attn_v.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 202: blk.22.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 203: blk.22.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 204: blk.22.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 205: blk.22.ffn_down.weight q5_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 206: blk.22.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 207: blk.22.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 208: blk.23.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 209: blk.23.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 210: blk.23.attn_v.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 211: blk.23.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 212: blk.23.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 213: blk.23.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 214: blk.23.ffn_down.weight q5_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 215: blk.23.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 216: blk.23.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 217: blk.24.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 218: blk.24.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 219: blk.24.attn_v.weight q6_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 220: blk.24.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 221: blk.24.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 222: blk.24.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 223: blk.24.ffn_down.weight q6_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 224: blk.24.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 225: blk.24.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 226: blk.25.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 227: blk.25.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 228: blk.25.attn_v.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 229: blk.25.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 230: blk.25.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 231: blk.25.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 232: blk.25.ffn_down.weight q5_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 233: blk.25.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 234: blk.25.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 235: blk.26.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 236: blk.26.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 237: blk.26.attn_v.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 238: blk.26.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 239: blk.26.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 240: blk.26.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 241: blk.26.ffn_down.weight q5_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 242: blk.26.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 243: blk.26.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 244: blk.27.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 245: blk.27.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 246: blk.27.attn_v.weight q6_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 247: blk.27.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 248: blk.27.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 249: blk.27.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 250: blk.27.ffn_down.weight q6_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 251: blk.27.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 252: blk.27.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 253: blk.28.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 254: blk.28.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 255: blk.28.attn_v.weight q6_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 256: blk.28.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 257: blk.28.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 258: blk.28.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 259: blk.28.ffn_down.weight q6_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 260: blk.28.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 261: blk.28.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 262: blk.29.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 263: blk.29.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 264: blk.29.attn_v.weight q6_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 265: blk.29.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 266: blk.29.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 267: blk.29.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 268: blk.29.ffn_down.weight q6_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 269: blk.29.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 270: blk.29.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 271: blk.30.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 272: blk.30.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 273: blk.30.attn_v.weight q6_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 274: blk.30.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 275: blk.30.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 276: blk.30.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 277: blk.30.ffn_down.weight q6_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 278: blk.30.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 279: blk.30.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 280: blk.31.attn_q.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 281: blk.31.attn_k.weight q5_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 282: blk.31.attn_v.weight q6_K [ 4096, 1024, 1, 1 ] llama_model_loader: - tensor 283: blk.31.attn_output.weight q5_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 284: blk.31.ffn_gate.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 285: blk.31.ffn_up.weight q5_K [ 4096, 14336, 1, 1 ] llama_model_loader: - tensor 286: blk.31.ffn_down.weight q6_K [ 14336, 4096, 1, 1 ] llama_model_loader: - tensor 287: blk.31.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 288: blk.31.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 289: output_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 290: output.weight q6_K [ 4096, 32000, 1, 1 ] llama_model_loader: - kv 0: general.architecture str llama_model_loader: - kv 1: str llama_model_loader: - kv 2: llama.context_length u32 llama_model_loader: - kv 3: llama.embedding_length u32 llama_model_loader: - kv 4: llama.block_count u32 llama_model_loader: - kv 5: llama.feed_forward_length u32 llama_model_loader: - kv 6: llama.rope.dimension_count u32 llama_model_loader: - kv 7: llama.attention.head_count u32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 llama_model_loader: - kv 10: llama.rope.freq_base f32 llama_model_loader: - kv 11: general.file_type u32 llama_model_loader: - kv 12: tokenizer.ggml.model str llama_model_loader: - kv 13: tokenizer.ggml.tokens arr llama_model_loader: - kv 14: tokenizer.ggml.scores arr llama_model_loader: - kv 15: tokenizer.ggml.token_type arr llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 llama_model_loader: - kv 19: general.quantization_version u32 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q5_K: 193 tensors llama_model_loader: - type q6_K: 33 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V2 llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = mostly Q5_K - Medium llm_load_print_meta: model params = 7.24 B llm_load_print_meta: model size = 4.78 GiB (5.67 BPW) llm_load_print_meta: = mistralai_mistral-7b-instruct-v0.1 llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.11 MB llm_load_tensors: using ROCm for GPU acceleration ggml_cuda_set_main_device: using device 0 (AMD Radeon RX 7900 XTX) as main device llm_load_tensors: mem required = 86.04 MB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 35/35 layers to GPU llm_load_tensors: VRAM used: 4807.05 MB .................................................................................................. llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: offloading v cache to GPU llama_kv_cache_init: offloading k cache to GPU llama_kv_cache_init: VRAM kv self = 64.00 MB llama_new_context_with_model: kv self size = 64.00 MB llama_build_graph: non-view tensors processed: 740/740 llama_new_context_with_model: compute buffer total size = 79.63 MB llama_new_context_with_model: VRAM scratch buffer: 73.00 MB llama_new_context_with_model: total VRAM used: 4944.06 MB (model: 4807.05 MB, context: 137.00 MB) fish: Job 1, './main -ngl 99 -m ../koboldcpp/…' terminated by signal SIGSEGV (Address boundary error) ```
8XXD8 commented 10 months ago

This could be Rdna3/gfx1100 specific. I cross compiled from a debian nvidia build for gfx900 target and it worked fine

8XXD8 commented 10 months ago

Well i managed to run into the same problem. After updating llama-cpp-python the only response i get is an endless stream of "K" characters. I tried compiling separately, but had the same result.

Its odd that main works with this:

make main LLAMA_HIPBLAS=on

but produces garbage

bojak83318 commented 9 months ago

Anyways my lanes are not equal , this could be a problem as in rocm 5.7 multi gpu is just a preview ''' Recommended system configuration for multi-GPU#

PCIe® slots connected to the GPU must have identical PCIe lane width or bifurcation settings, and support PCIe 3.0 Atomics.

Refer to How ROCm uses PCIe Atomics for more information.


✓ - GPU0 PCIe x16 connection + GPU1 PCIe x16 connection

✓ - GPU0 PCIe x8 connection + GPU1 PCIe x8 connection

X  - GPU0 PCIe x16 connection + GPU1 PCIe x8 connection

''' Source :

8XXD8 commented 9 months ago

Anyways my lanes are not equal , this could be a problem as in rocm 5.7 multi gpu is just a preview ''' Recommended system configuration for multi-GPU#

PCIe® slots connected to the GPU must have identical PCIe lane width or bifurcation settings, and support PCIe 3.0 Atomics.

Refer to How ROCm uses PCIe Atomics for more information.


✓ - GPU0 PCIe x16 connection + GPU1 PCIe x16 connection

✓ - GPU0 PCIe x8 connection + GPU1 PCIe x8 connection

X - GPU0 PCIe x16 connection + GPU1 PCIe x8 connection

''' Source :

Have you tried removing one of the GPU-s? I mean physically, not just disabling it with HIP_VISIBLE_DEVICES. That way you can see if it is a multi gpu / pcie lane issue, or something else.

YellowRoseCx commented 9 months ago

I believe the problem lies in how the initialization process is bugged in ROCm. It has been fixed, but the fix will not be released until ROCm 6.0.0 I believe; unless you build ROCm using a self compiled ROCBlas and Tensile yourself after these commits: rocBLAS Commit ID: bc4d8f5 Tensile Commit ID: ROCmSoftwarePlatform/Tensile@24d54d7

github-actions[bot] commented 5 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.