ggerganov / llama.cpp

LLM inference in C/C++
MIT License
64.87k stars 9.3k forks source link

regression: output is nonsense with latest commit and CUDA support enabled #7451

Closed enolan closed 3 months ago

enolan commented 3 months ago

On 201cc11a, I get gibberish output trying to sample from Llama-3-8B quantized with Q5_K_M (same behavior with Q8_0, F16, F32, and Q4_K_M). This happens when llama.cpp is built with CUDA support, but not without. I'm building these with Nix. Here's an example output:

enolan@chonk ~/j/llama.cpp (master)> ./result-cuda-201cc11a/bin/llama -m /bulk/LLaMA/Meta-Llama-3-8B/ggml-model-q5_k_m.gguf -p "'''Seersucker''' or '''railroad stripe''' is a thin, puckered, usually [[cotton]] [[textile|fabric]], commonly but not necessarily striped or chequered, used to make clothing for hot weather." -s 1 --color -n 64                                                                                                                                         

Log start                                                                                                                                                                                                                                     
main: build = 0 (unknown)                                                                                                                                                                                                                     
main: built with gcc (GCC) 12.3.0 for x86_64-unknown-linux-gnu                                                         
main: seed  = 1                                                                                                        
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /bulk/LLaMA/Meta-Llama-3-8B/ggml-model-q5_k_m.gguf (version GGUF V3 (latest))                                                                               
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.                      
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 17
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q5_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00                                                                                                                                                                                               
llm_load_print_meta: f_logit_scale    = 0.0e+00                                                                                                                                                                                               
llm_load_print_meta: n_ff             = 14336                                                                          
llm_load_print_meta: n_expert         = 0                                                                              
llm_load_print_meta: n_expert_used    = 0                                                                                                                                                                                                     
llm_load_print_meta: causal attn      = 1                                                                              
llm_load_print_meta: pooling type     = 0                                                                              
llm_load_print_meta: rope type        = 0                                                                              
llm_load_print_meta: rope scaling     = linear                                                                         
llm_load_print_meta: freq_base_train  = 500000.0                                                                       
llm_load_print_meta: freq_scale_train = 1                                                                              
llm_load_print_meta: n_yarn_orig_ctx  = 8192                                                                           
llm_load_print_meta: rope_finetuned   = unknown                                                                        
llm_load_print_meta: ssm_d_conv       = 0                                                                              
llm_load_print_meta: ssm_d_inner      = 0                                                                              
llm_load_print_meta: ssm_d_state      = 0                                                                              
llm_load_print_meta: ssm_dt_rank      = 0                                                                              
llm_load_print_meta: model type       = 8B                                                                             
llm_load_print_meta: model ftype      = Q5_K - Medium                                                                  
llm_load_print_meta: model params     = 8.03 B                                                                         
llm_load_print_meta: model size       = 5.33 GiB (5.70 BPW)                                          
llm_load_print_meta: general.name     = Meta-Llama-3-8B                                                                                                                                                                                       
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'                                                                                                                                                                            
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'                                                                                                                                                                              
llm_load_print_meta: LF token         = 128 'Ä'                                                                        
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'                                       
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no                                                                              
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes   
ggml_cuda_init: found 1 CUDA devices:        
  Device 0: NVIDIA GeForce RTX 2080, compute capability 7.5, VMM: yes
llm_load_tensors: ggml ctx size =    0.15 MiB                                                                          
llm_load_tensors: offloading 0 repeating layers to GPU  
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  5459.93 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =    64.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   669.48 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     9.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 356

system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 512, n_batch = 2048, n_predict = 64, n_keep = 0

<|begin_of_text|>'''Seersucker''' or '''railroad stripe''' is a thin, puckered, usually [[cotton]] [[textile|fabric]], commonly but not necessarily striped or chequered, used to make clothing for hot weather. Annapolis, a town in Maryland, has the highest concentration of naval officers in the US. It was once home to the US Naval Academy, the most prominent naval academy in the country. In 2011, the academy was moved to Washington, DC, and the Naval Academy has since been renamed as the Naval
llama_print_timings:        load time =     453.35 ms
llama_print_timings:      sample time =       5.42 ms /    64 runs   (    0.08 ms per token, 11816.84 tokens per second)
llama_print_timings: prompt eval time =     562.33 ms /    48 tokens (   11.72 ms per token,    85.36 tokens per second)
llama_print_timings:        eval time =    9444.97 ms /    63 runs   (  149.92 ms per token,     6.67 tokens per second)
llama_print_timings:       total time =   10052.76 ms /   111 tokens
Log end

It starts talking about Annapolis, Maryland for some reason, instead of fabric. Other seeds are also nonsense, either gibberish or a nonsensical change of topic. In contrast, CPU only build is fine:

enolan@chonk ~/j/llama.cpp (master)> ./result-cpuonly-201cc11a/bin/llama -m /bulk/LLaMA/Meta-Llama-3-8B/ggml-model-q5_k_m.gguf -p "'''Seersucker''' or '''railroad stripe''' is a thin, puckered, usually [[cotton]] [[textile|fabric]], commonly but not necessarily striped or chequered, used to make clothing for hot weather." -s 1 --color -n 64               
Log start                                                                                                                                                                                                                                     
main: build = 0 (unknown)                                                                                                                                                                                                                     
main: built with gcc (GCC) 13.2.0 for x86_64-unknown-linux-gnu                                                                                                                                                                                
main: seed  = 1                                                                                                        
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /bulk/LLaMA/Meta-Llama-3-8B/ggml-model-q5_k_m.gguf (version GGUF V3 (latest))                                                                               
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.                      
llama_model_loader: - kv   0:                       general.architecture str              = llama                      
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B            
llama_model_loader: - kv   2:                          llama.block_count u32              = 32                         
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192                       
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096                       
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336                      
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32                         
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8                          
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000              
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010                   
llama_model_loader: - kv  10:                          general.file_type u32              = 17                         
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256                     
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128                        
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2                       
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe                  
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...                                                                                                          
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...                                                                                                          
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...                                                                                                                    
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000                     
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q5_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00                                                                        
llm_load_print_meta: f_max_alibi_bias = 0.0e+00                                                                                                                                                                                               
llm_load_print_meta: f_logit_scale    = 0.0e+00                                                                                                                                                                                               
llm_load_print_meta: n_ff             = 14336                                                                                                                                                                                                 
llm_load_print_meta: n_expert         = 0                                                                              
llm_load_print_meta: n_expert_used    = 0                                                                                                                                                                                                     
llm_load_print_meta: causal attn      = 1                                                                              
llm_load_print_meta: pooling type     = 0                                                                              
llm_load_print_meta: rope type        = 0                                                                              
llm_load_print_meta: rope scaling     = linear                                                                         
llm_load_print_meta: freq_base_train  = 500000.0                                                                       
llm_load_print_meta: freq_scale_train = 1                                                                              
llm_load_print_meta: n_yarn_orig_ctx  = 8192                                                                           
llm_load_print_meta: rope_finetuned   = unknown                                                                        
llm_load_print_meta: ssm_d_conv       = 0                                                                              
llm_load_print_meta: ssm_d_inner      = 0                                                                              
llm_load_print_meta: ssm_d_state      = 0                                                                                                                                                                                                     
llm_load_print_meta: ssm_dt_rank      = 0                                                                                                                                                                                                     
llm_load_print_meta: model type       = 8B                                                                                                                                                                                                    
llm_load_print_meta: model ftype      = Q5_K - Medium                                                                  
llm_load_print_meta: model params     = 8.03 B                                                                                                                                                                                                
llm_load_print_meta: model size       = 5.33 GiB (5.70 BPW)                                                                                                                                                                                   
llm_load_print_meta: general.name     = Meta-Llama-3-8B                                                                                                                                                                                       
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'                                                                                                                                                                            
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'                                                                                                                                                                              
llm_load_print_meta: LF token         = 128 'Ä'                                                                                                                                                                                               
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'                                                            
llm_load_tensors: ggml ctx size =    0.15 MiB                                                                          
llm_load_tensors:        CPU buffer size =  5459.93 MiB                                                                                                                                                                                       
.........................................................................................                                                                                                                                                     
llama_new_context_with_model: n_ctx      = 512       
llama_new_context_with_model: n_batch    = 512                                                                         
llama_new_context_with_model: n_ubatch   = 512                                                                         
llama_new_context_with_model: flash_attn = 0                                                                           
llama_new_context_with_model: freq_base  = 500000.0                                                                                                                                                                                           
llama_new_context_with_model: freq_scale = 1                                                                           
llama_kv_cache_init:        CPU KV buffer size =    64.00 MiB                                                          
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB                  
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB                                            
llama_new_context_with_model:        CPU compute buffer size =   258.50 MiB                                            
llama_new_context_with_model: graph nodes  = 1030                                                                                                                                                                                             
llama_new_context_with_model: graph splits = 1                                                                                                                                                                                                

system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |                                                                                                                                                                                          
sampling:                                                                                                                                                                                                                                     
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000                                                                                                                                       
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800                       
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000                                                        
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature                                       
generate: n_ctx = 512, n_batch = 2048, n_predict = 64, n_keep = 0                                                                                                                                                                             

<|begin_of_text|>'''Seersucker''' or '''railroad stripe''' is a thin, puckered, usually [[cotton]] [[textile|fabric]], commonly but not necessarily striped or chequered, used to make clothing for hot weather.  '''Seersucker''' or '''railroad stripe''' is a thin, puckered, usually [[cotton]] [[textile|fabric]], commonly but not necessarily striped or chequered, used to make clothing for hot weather.                                                                           
+ Seersucker fabrics are woven with extra threads of yarn, which are left                                              
llama_print_timings:        load time =     380.38 ms                                                                  
llama_print_timings:      sample time =       4.86 ms /    64 runs   (    0.08 ms per token, 13179.57 tokens per second)
llama_print_timings: prompt eval time =    1803.44 ms /    48 tokens (   37.57 ms per token,    26.62 tokens per second)
llama_print_timings:        eval time =    9420.80 ms /    63 runs   (  149.54 ms per token,     6.69 tokens per second)
llama_print_timings:       total time =   11269.28 ms /   111 tokens                                                   
Log end                                                                                                                

It's repeating itself, but it at least makes sense. 6369bf04 (the previous commit) is fine for CUDA (and CPU):

enolan@chonk ~/j/llama.cpp (master)> ./result-cuda-6369bf04/bin/llama -m /bulk/LLaMA/Meta-Llama-3-8B/ggml-model-q5_k_m.gguf -p "'''Seersucker''' or '''railroad stripe''' is a thin, puckered, usually [[cotton]] [[textile|fabric]], commonly but not necessarily striped or chequered, used to make clothing for hot weather." -s 1 --color -n 64                                                                                                                                         Log start                                            
main: build = 0 (unknown)                      
main: built with gcc (GCC) 12.3.0 for x86_64-unknown-linux-gnu                                                         
main: seed  = 1                                                                                                        
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /bulk/LLaMA/Meta-Llama-3-8B/ggml-model-q5_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama                      
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B            
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096                                                                                                                                              llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336                                                                                                                                             
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000                                                                                                                                     
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010                                                                                                                                          
llama_model_loader: - kv  10:                          general.file_type u32              = 17                                                                                                                                                
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256                     
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2                                                                                                                                              
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q5_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00                                                                                                                                                                                               llm_load_print_meta: f_max_alibi_bias = 0.0e+00      
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336                                                                          
llm_load_print_meta: n_expert         = 0                                                                              
llm_load_print_meta: n_expert_used    = 0                                                                              
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0                                                                              
llm_load_print_meta: rope type        = 0                                                                              
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1                                                                                                                                                                                                     llm_load_print_meta: n_yarn_orig_ctx  = 8192                                                                                                                                                                                                  
llm_load_print_meta: rope_finetuned   = unknown           
llm_load_print_meta: ssm_d_conv       = 0            
llm_load_print_meta: ssm_d_inner      = 0                                                                                                                                                                                                     
llm_load_print_meta: ssm_d_state      = 0                                                                                                                                                                                                     
llm_load_print_meta: ssm_dt_rank      = 0                                                                                                                                                                                                     
llm_load_print_meta: model type       = 8B                                                                             
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 8.03 B                                                                                                                                                                                                
llm_load_print_meta: model size       = 5.33 GiB (5.70 BPW)  
llm_load_print_meta: general.name     = Meta-Llama-3-8B
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>' 
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2080, compute capability 7.5, VMM: yes
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  5459.93 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =    64.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   669.48 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     9.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 356

system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 512, n_batch = 2048, n_predict = 64, n_keep = 0

<|begin_of_text|>'''Seersucker''' or '''railroad stripe''' is a thin, puckered, usually [[cotton]] [[textile|fabric]], commonly but not necessarily striped or chequered, used to make clothing for hot weather.  '''Seersucker''' or '''railroad stripe''' is a thin, puckered, usually [[cotton]] [[textile|fabric]], commonly but not necessarily striped or chequered, used to make clothing for hot weather.
+ Seersucker fabrics are woven with extra "bunching" yarns
llama_print_timings:        load time =     453.15 ms
llama_print_timings:      sample time =       5.00 ms /    64 runs   (    0.08 ms per token, 12812.81 tokens per second)
llama_print_timings: prompt eval time =     560.83 ms /    48 tokens (   11.68 ms per token,    85.59 tokens per second)
llama_print_timings:        eval time =    9432.22 ms /    63 runs   (  149.72 ms per token,     6.68 tokens per second)
llama_print_timings:       total time =   10037.80 ms /   111 tokens
Log end
justinsteven commented 3 months ago

I can reproduce on 201cc11afa0a1950e1f632390b2ac6c937a0d8f0

Startup ``` Log start main: build = 2961 (201cc11a) main: built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu main: seed = 1716356810 llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from /models/bartowski/Meta-Llama-3-8B-Instruct-GGUF/Meta-Llama-3-8B-Instruct-Q8_0.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 7 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128001 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - kv 22: quantize.imatrix.file str = /models/Meta-Llama-3-8B-Instruct-GGUF... llama_model_loader: - kv 23: quantize.imatrix.dataset str = /training_data/groups_merged.txt llama_model_loader: - kv 24: quantize.imatrix.entries_count i32 = 224 llama_model_loader: - kv 25: quantize.imatrix.chunks_count i32 = 88 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q8_0: 226 tensors validate_override: Using metadata override ( str) 'tokenizer.ggml.pre' = llama3 llm_load_vocab: special tokens definition check successful ( 256/128256 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q8_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 7.95 GiB (8.50 BPW) llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128001 '<|end_of_text|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 0.30 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 532.31 MiB llm_load_tensors: CUDA0 buffer size = 7605.33 MiB ......................................................................................... llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 512.00 MiB llama_new_context_with_model: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.49 MiB llama_new_context_with_model: CUDA0 compute buffer size = 296.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 16.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 2 system_info: n_threads = 8 / 8 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | main: interactive mode on. Reverse prompt: '<|eot_id|>' Reverse prompt: '### Instruction: ' Input prefix: ' <|start_header_id|>user<|end_header_id|> ' Input suffix: '<|eot_id|><|start_header_id|>assistant<|end_header_id|> ' sampling: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.700 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampling order: CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1 == Running in interactive mode. == - Press Ctrl+C to interject at any time. - Press Return to return control to LLaMa. - To return control without starting a new line, end your input with '/'. - If you want to submit another line, end your input with '\'. ```
<|begin_of_text|>
>
<|start_header_id|>user<|end_header_id|>

hello llama
<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hello, I am i am looking for a little help me<|eot_id|>

>
<|start_header_id|>user<|end_header_id|>

I know llama. I know.
<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hello, what can you want to help me<|eot_id|>

>
<|start_header_id|>user<|end_header_id|>
<|begin_of_text|>
>
<|start_header_id|>user<|end_header_id|>

hello llama
<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I'm not found this is a friendly assistant

I can help me
What are you'request

Hello there.<|eot_id|>

>
<|start_header_id|>user<|end_header_id|>

what can you help with
<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I'do<|eot_id|>

>
<|start_header_id|>user<|end_header_id|>
YannFollet commented 3 months ago

same for me on this thread https://github.com/ggerganov/llama.cpp/issues/7450

duynt575 commented 3 months ago

Yes, I also get gibberish output, older version like b2953 works normally for the same gguf file. The latest cuda version somehow generates gibberish, vulkan works fine.

wcde commented 3 months ago

I can confirm, after #7225 generation is completely broken. I checked it on CPU, 4090 and P40, on different models. I tried b2961, tried FORCE_MMQ, with and without FA. Nothing works. It's sad that we don't have normal autotests.

ggerganov commented 3 months ago

Check if https://github.com/ggerganov/llama.cpp/pull/7452 fixes the issue

ChryGigio commented 3 months ago

Looks good on my end cherry picking #7452 into master

./llama -m /mnt/ssd/ai/txtgen/models/gguf/neopolita_meta-llama-3-8b-instruct_q6_k.gguf -ngl 100 -p "'''Seersucker''' or '''railroad stripe''' is a thin, puckered, usually [[cotton]] [[textile|fabric]], commonly but not necessarily striped or chequered, used to make clothing for hot we
ather." -s 1 --color -n 64

----

<|begin_of_text|>'''Seersucker''' or '''railroad stripe''' is a thin, puckered, usually [[cotton]] [[textile|fabric]], commonly but not necessarily striped or chequered, used to make clothing for hot weather. Seersucker was originally an Indian cotton fabric, brought to the United States by [[colonial American]]s and [[Native Americans]]. It is typically white or light-colored with a contrasting stripe or check pattern.
Seersucker fabric is known for its unique texture, which is created by interlocking loops of yarn, which
llama_print_timings:        load time =    1162.67 ms
llama_print_timings:      sample time =       6.00 ms /    64 runs   (    0.09 ms per token, 10663.11 tokens per second)
llama_print_timings: prompt eval time =     125.71 ms /    48 tokens (    2.62 ms per token,   381.82 tokens per second)
llama_print_timings:        eval time =    1552.75 ms /    63 runs   (   24.65 ms per token,    40.57 tokens per second)
llama_print_timings:       total time =    1728.66 ms /   111 tokens
eamonnmag commented 3 months ago

I believe that this issue is still present on latest releases. The issue is more prevalent when multiple people connect to the same instance.

I've gone back now to ecab1c75de68de7c41c254e2ae170d3b07bee6d4 and it works as before, since I really just need the new /health endpoint.