ggerganov / llama.cpp

LLM inference in C/C++
MIT License
64.23k stars 9.19k forks source link

Bug: Llama3 8B Instruct Model outputting nonsensical text on AMD GPUs. #7984

Closed aymane-eljerari closed 2 months ago

aymane-eljerari commented 2 months ago

What happened?

I am running Llama3 8B Instruct, but the model output doesn't make sense. I followed the general guidelines of the main (cli) and also tried the necessary prompts for Llama3 as specified by Meta.

I tried two different models:

  1. I used the convert-hf-to-gguf.py to generate the gguf file for the original Llama3 8B model.
  2. Downloaded the Q8 version of Llama3 8B from bartowski.

Below is the bash script I call to run the model with the Meta suggested prompt format:

system_prompt="You are a helpful AI assistant, answer the user's questions most appropriately.. Be witty and precise but do not answer incorrectly."
user_prompt="What were Ada Lovelace's contribution to modern computing?"

./llama.cpp/llama-cli -m llama.cpp/models/llama3/Meta-Llama-3-8B-Instruct/ggml-model-f16.gguf -ngl 50 -n 256 -c 8192 -i --prompt "<|begin_of_text|><|start_header_id|>system<|end_header_id|>

${system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>

${user_prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>" --in-prefix "<|eot_id|><|start_header_id|>user<|end_header_id|>" --in-suffix "<|eot_id|><|start_header_id|>assistant<|end_header_id|>"

Additional Considerations:

Questions that I have:

  1. Is this issue due to the incorrect prompting format?
  2. Are the --in-prefix and --in-suffix necessary additions?
  3. What steps to take to solve this?

Name and Version

llama.cpp/llama-cli --version version: 0 (unknown) built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

root@05529a955a10:~/git/rocm-llm# ./run_llama3_8B.sh 
Log start
main: build = 0 (unknown)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1718665517
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from llama.cpp/models/llama3/Meta-Llama-3-8B-Instruct/ggml-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  226 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 14.96 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 2 ROCm devices:
  Device 0: , compute capability 9.0, VMM: no
  Device 1: , compute capability 9.0, VMM: no
llm_load_tensors: ggml ctx size =    0.44 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      ROCm0 buffer size =  7072.53 MiB
llm_load_tensors:      ROCm1 buffer size =  7242.48 MiB
llm_load_tensors:        CPU buffer size =  1002.00 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      ROCm0 KV buffer size =   544.00 MiB
llama_kv_cache_init:      ROCm1 KV buffer size =   480.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:  ROCm_Host  output buffer size =     0.49 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
llama_new_context_with_model:      ROCm0 compute buffer size =   640.01 MiB
llama_new_context_with_model:      ROCm1 compute buffer size =   640.02 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =    72.02 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 3

system_info: n_threads = 128 / 256 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
llama_tokenize_internal: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
main: interactive mode on.
Input prefix: '<|eot_id|><|start_header_id|>user<|end_header_id|>'
Input suffix: '<|eot_id|><|start_header_id|>assistant<|end_header_id|>'
sampling: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 8192, n_batch = 2048, n_predict = 256, n_keep = 0

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

system

You are a helpful AI assistant, answer the user's questions most appropriately.. Be witty and precise but do not answer incorrectly.user

What were Ada Lovelace's contribution to modern computing?assistant polals.factoryornoals.googleangoalsorge Flake问 Bilg burstals burstornoingo Juniclesalsidesant Bilgorgetemplorge Northern Anticlesidesidesango burst Northernicles Northernansals burst Bilgorno compassingo attendantango.factoryorge attendantidesantussisinides.googleingo Bilgornoalsallenalsingo bursttempl Bilgides Abs NorthernAntorno Northernodos Northernallen Bilg Northernalsinoingoightsidesisinalsalsalsornoides Northern Bilg compass Bilgodos.googleingoalsingoidesantalsalsans polornoicles问idesorge Absisinidesandsands问ussandsorgeorge polorge Hindalstemplingo问 attendantidesorgeorgeingoightsingo Junidesorgeantorge Northernidesingoalsodosorgeingo appartingoingoalsingo Bilgidesides Flakeorgeornoinoorge.googleidesingo Northernuss burst� attendantidesingoornotemplornoino pol问ingo Bilg compassingoant polalsides attendantingoidesisinussingoingoingoodosingo Anttemplinoidesornoidestemplussussides Jinodosisinidesornoorge Cranornoidesorge.googleals Hind Cran NorthernidesalsorgeAnttemplalsorgeornoingoisinidesAntides Bilg Northernides问idesisin问ingoingoorgeidesornoalsingo�ornoornoidesides Northernodos attendantalstempl Northern Absandsingo Northern<|eot_id|><|start_header_id|>user<|end_header_id|>hello
<|eot_id|><|start_header_id|>assistant<|end_header_id|>ansorgeightsalsuss polodos.google burstorge�ingoidesorgeiclesights Craningoodos Flake.factorytempl AbsAnt Northernornoingoingoisinidesingoalsingoorno burst Absornoisin Hind Northern�angoans Absorno Bilgornoino Northerningoornoidesinoisinals�ussinoisinorgeodos.googleornoides Northernorgealsorno Bilgingo Craninoingoorgeodosorge Bilgango问 Absodosingoalsingoorno Ant Hindorgeidesingo polides Cranalsorno polinoingoussicles Absides compassorge compassidesidesuss Hindalstempl Northernals问 Northernandsornoidesals.googleidesidesans Bilg Bilgingoidesiclesorge问orgeussangoorgeinoorgeidesussalsornoingoallen Northernangoingoansalsorge Northern Northern Cran pol Flakeingoussino.googleidesidesodosalsicles Absalsides Hindisin appartorno Cranantornoorgeorgeorge.googlealsingoornoorgealsornoinoidesalsornouss.googleicles Absides Antingoidesidesornoinoidesornoangoides Bilgalstemplussingoides Northern�ingo Northernals Flakeornoorgeidesingoicles问 Northern Antinoorno Northern Bilg Cranals Flakealsingo polingo pol�als�ides burstallentempl Bilg attendantansantantingo bursticlesisinornoodosisinorgeides<|eot_id|><|start_header_id|>user<|end_header_id|>

llama_print_timings:        load time =   14053.54 ms
llama_print_timings:      sample time =     105.58 ms /   510 runs   (    0.21 ms per token,  4830.51 tokens per second)
llama_print_timings: prompt eval time =    6772.69 ms /    63 tokens (  107.50 ms per token,     9.30 tokens per second)
llama_print_timings:        eval time =   17245.63 ms /   508 runs   (   33.95 ms per token,    29.46 tokens per second)
llama_print_timings:       total time =   25450.76 ms /   571 tokens
aymane-eljerari commented 2 months ago

It looks like the issue was completely unrelated. The model outputs make sense only when --ngl is set to 0.

aymane-eljerari commented 2 months ago

6208 Addresses this issue. The problem was fixed by adding LLAMA_CUDA_NO_PEER_COPY=1 to the compilation step.