Model always responds in Chinese, ignores system prompts stating to only reply in English

sammcj commented 1 week ago

DeepSeek Coder V2 seems to only respond in Chinese, this occurs even when the system prompt explicitly states to only respond in English:

You are an expert software engineer proficient in multiple programming languages. Your task is to generate, complete, and refactor code snippets based on the given instructions. Provide clean, efficient, and well-commented code.

IMPORTANT: Always respond in English.

Still results in Chinese rather than English:

guoday commented 1 week ago

We find that this issue is due to quantization, as we did not find this problem in our bf16 checkpoint.

sammcj commented 1 week ago

This is interesting! Is there anything we can do when quantizing the model to prevent this from occurring?

I know that for models used a lot with embeddings, keeping the token-embedding type at f16 while quantizing the rest of the model as normal can help.

guoday commented 1 week ago

Recommend this quantization method: https://github.com/spcl/QuaRot. Alternatively, do not quantize attention or share experts.

guoday commented 1 week ago

Another approach is to change the quantization data. The quantization data should be organized in our chat template; by default, it is likely organized in the base format.

sammcj commented 1 week ago

According to your README, the best chat template to use is:

<｜begin▁of▁sentence｜>{system_message}

User: {user_message_1}

Assistant: {assistant_message_1}<｜end▁of▁sentence｜>User: {user_message_2}

Assistant:

I believe the template I'm using is the same, here it is in Ollama's Modelfile template format (note that the BOS/EOS tokens are automatically added):

{{ if .System }}{{ .System }}

{{ end }}{{ if .Prompt }}User: {{ .Prompt }}

{{ end }}Assistant: {{ .Response }}

sammcj commented 1 week ago

Here's an example of a popular quantized version of the model: https://huggingface.co/bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF?show_file_info=DeepSeek-Coder-V2-Lite-Instruct-Q6_K.gguf

Here we see:

tokenizer.ggml.bos_token_id     100000
tokenizer.ggml.eos_token_id     100001
tokenizer.ggml.padding_token_id     100001
tokenizer.ggml.add_bos_token    true
tokenizer.ggml.add_eos_token    false

and a template of:

tokenizer.chat_template     {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{{ bos_token }}{% for message in messages %}{% if message['role'] == 'user' %}{{ 'User: ' + message['content'] + ' ' }}{% elif message['role'] == 'assistant' %}{{ 'Assistant: ' + message['content'] + eos_token }}{% elif message['role'] == 'system' %}{{ message['content'] + ' ' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:' }}{% endif %}

However - if the template is provided in the Ollama Modelfile (as in my previous comment) it will be used instead as long as those bos/eos/pad token IDs are correct.

guoday commented 1 week ago

The chat template of ollama is correct. I mean that during the quantization process, it might be better to organize the data in the form of a chat template.

sammcj commented 1 week ago

I don't think it's quantization that's causing it.

I just converted the HF safetensors straight to f16 GGUF with no quantization and it still outputs in Chinese:

llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.name str              = deepseek-ai_DeepSeek-Coder-V2-Lite-In...
llama_model_loader: - kv   2:                      deepseek2.block_count u32              = 27
llama_model_loader: - kv   3:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv   4:                 deepseek2.embedding_length u32              = 2048
llama_model_loader: - kv   5:              deepseek2.feed_forward_length u32              = 10944
llama_model_loader: - kv   6:             deepseek2.attention.head_count u32              = 16
llama_model_loader: - kv   7:          deepseek2.attention.head_count_kv u32              = 16
llama_model_loader: - kv   8:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv   9: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                deepseek2.expert_used_count u32              = 6
llama_model_loader: - kv  11:                          general.file_type u32              = 1
llama_model_loader: - kv  12:        deepseek2.leading_dense_block_count u32              = 1
llama_model_loader: - kv  13:                       deepseek2.vocab_size u32              = 102400
llama_model_loader: - kv  14:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  15:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  16:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  17:       deepseek2.expert_feed_forward_length u32              = 1408
llama_model_loader: - kv  18:                     deepseek2.expert_count u32              = 64
llama_model_loader: - kv  19:              deepseek2.expert_shared_count u32              = 2
llama_model_loader: - kv  20:             deepseek2.expert_weights_scale f32              = 1.000000
llama_model_loader: - kv  21:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  22:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  23:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  24: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  25: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.070700
llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = deepseek-llm
llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,102400]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,102400]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  30:                      tokenizer.ggml.merges arr[str,99757]   = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e...
llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 100000
llama_model_loader: - kv  32:                tokenizer.ggml.eos_token_id u32              = 100001
llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 100001
llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  35:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  37:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  108 tensors
llama_model_loader: - type  f16:  269 tensors
llm_load_vocab: special tokens cache size = 2400
llm_load_vocab: token to piece cache size = 0.6661 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = deepseek2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 102400
llm_load_print_meta: n_merges         = 99757
llm_load_print_meta: n_ctx_train      = 163840
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_layer          = 27
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_embd_head_k    = 192
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 3072
llm_load_print_meta: n_embd_v_gqa     = 2048
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 10944
llm_load_print_meta: n_expert         = 64
llm_load_print_meta: n_expert_used    = 6
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = yarn
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 0.025
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 16B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 15.71 B
llm_load_print_meta: model size       = 29.26 GiB (16.00 BPW)
llm_load_print_meta: general.name     = deepseek-ai_DeepSeek-Coder-V2-Lite-Instruct
llm_load_print_meta: BOS token        = 100000 '<｜begin▁of▁sentence｜>'
llm_load_print_meta: EOS token        = 100001 '<｜end▁of▁sentence｜>'
llm_load_print_meta: PAD token        = 100001 '<｜end▁of▁sentence｜>'
llm_load_print_meta: LF token         = 126 'Ä'
llm_load_print_meta: n_layer_dense_lead   = 1
llm_load_print_meta: n_lora_q             = 0
llm_load_print_meta: n_lora_kv            = 512
llm_load_print_meta: n_ff_exp             = 1408
llm_load_print_meta: n_expert_shared      = 2
llm_load_print_meta: expert_weights_scale = 1.0
llm_load_print_meta: rope_yarn_log_mul    = 0.0707
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA RTX A4000, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.53 MiB
time=2024-06-20T11:52:38.664Z level=INFO source=server.go:585 msg="waiting for server to become available" status="llm server not responding"
llm_load_tensors: offloading 27 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 28/28 layers to GPU
llm_load_tensors:        CPU buffer size =   400.00 MiB
llm_load_tensors:      CUDA0 buffer size = 19122.57 MiB
llm_load_tensors:      CUDA1 buffer size = 10441.92 MiB
time=2024-06-20T11:52:40.270Z level=INFO source=server.go:585 msg="waiting for server to become available" status="llm server loading model"
llama_new_context_with_model: flash_attn requires n_embd_head_k == n_embd_head_v - forcing off
llama_new_context_with_model: n_ctx      = 20480
llama_new_context_with_model: n_batch    = 1024
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init:      CUDA0 KV buffer size =  3600.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =  1800.00 MiB
llama_new_context_with_model: KV self size  = 5400.00 MiB, K (f16): 3240.00 MiB, V (f16): 2160.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.40 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
llama_new_context_with_model:      CUDA0 compute buffer size =   840.01 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =   840.02 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   164.02 MiB
llama_new_context_with_model: graph nodes  = 1924
llama_new_context_with_model: graph splits = 3
INFO [main] model loaded | tid="140446201499648" timestamp=1718884364

guoday commented 1 week ago

We did not encounter this issue during our internal testing of FP16. We will re-test this prompt for you.

sammcj commented 1 week ago

Thank you, that would be very interesting to see.

Here's my conversion steps if it helps:

#!/usr/bin/env bash

# 0. Clone llama.cpp and install requirements for convert-hf-to-gguf.py
# 1. Download the model from huggingface
# 2. Update the paths below
# 3. Run this script

/path/to/llama.cpp/convert-hf-to-gguf.py \
  /models/DeepSeek-Coder-V2-Lite-Instruct \
  --outtype f16 \
  --outfile DeepSeek-Coder-V2-Lite-Instruct.f16.gguf

guoday commented 1 week ago

I used this code to test your prompt. It generated almost the same output, except that the English text was translated into Chinese. I need some time to confirm what caused this issue. It might be the process of converting to GGUF.

guoday commented 1 week ago

Thank you, that would be very interesting to see.

Here's my conversion steps if it helps:

#!/usr/bin/env bash

# 0. Clone llama.cpp and install requirements for convert-hf-to-gguf.py
# 1. Download the model from huggingface
# 2. Update the paths below
# 3. Run this script

/path/to/llama.cpp/convert-hf-to-gguf.py \
  /models/DeepSeek-Coder-V2-Lite-Instruct \
  --outtype f16 \
  --outfile DeepSeek-Coder-V2-Lite-Instruct.f16.gguf

Can you try using --outtype bf16? This will help us analyze the issue. Thank you.

sammcj commented 1 week ago

It seems Ollama doesn't support BF16, I can try with llama.cpp but it's getting late here so I'll give it a go tomorrow.

--

Also FYI - HF->GGUF convertion logs if its useful:

INFO:hf-to-gguf:Loading model: deepseek-ai_DeepSeek-Coder-V2-Lite-Instruct
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:gguf: context length = 163840
INFO:hf-to-gguf:gguf: embedding length = 2048
INFO:hf-to-gguf:gguf: feed forward length = 10944
INFO:hf-to-gguf:gguf: head count = 16
INFO:hf-to-gguf:gguf: key-value head count = 16
INFO:hf-to-gguf:gguf: rope theta = 10000
INFO:hf-to-gguf:gguf: rms norm epsilon = 1e-06
INFO:hf-to-gguf:gguf: experts used count = 6
INFO:hf-to-gguf:gguf: file type = 32
INFO:hf-to-gguf:Set model tokenizer
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:gguf.vocab:Adding 99757 merge(s).
INFO:gguf.vocab:Setting special token type bos to 100000
INFO:gguf.vocab:Setting special token type eos to 100001
INFO:gguf.vocab:Setting special token type pad to 100001
INFO:gguf.vocab:Setting add_bos_token to True
INFO:gguf.vocab:Setting add_eos_token to False
INFO:gguf.vocab:Setting chat_template to {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{{ bos_token }}{% for message in messages %}{% if message['role'] == 'user' %}{{ 'User: ' + message['content'] + '

' }}{% elif message['role'] == 'assistant' %}{{ 'Assistant: ' + message['content'] + eos_token }}{% elif message['role'] == 'system' %}{{ message['content'] + '

' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:' }}{% endif %}
INFO:hf-to-gguf:Exporting model to '/mnt/llm/models/DeepSeek-Coder-V2-Lite-Instruct.bf16.bin'
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-000004.safetensors'
INFO:hf-to-gguf:output.weight,                torch.bfloat16 --> BF16, shape = {2048, 102400}
INFO:hf-to-gguf:token_embd.weight,            torch.bfloat16 --> BF16, shape = {2048, 102400}
INFO:hf-to-gguf:blk.0.attn_norm.weight,       torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.0.ffn_down.weight,        torch.bfloat16 --> BF16, shape = {10944, 2048}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,        torch.bfloat16 --> BF16, shape = {2048, 10944}
INFO:hf-to-gguf:blk.0.ffn_up.weight,          torch.bfloat16 --> BF16, shape = {2048, 10944}
INFO:hf-to-gguf:blk.0.ffn_norm.weight,        torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.0.attn_kv_a_norm.weight,  torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.0.attn_kv_a_mqa.weight,   torch.bfloat16 --> BF16, shape = {2048, 576}
INFO:hf-to-gguf:blk.0.attn_kv_b.weight,       torch.bfloat16 --> BF16, shape = {512, 4096}
INFO:hf-to-gguf:blk.0.attn_output.weight,     torch.bfloat16 --> BF16, shape = {2048, 2048}
INFO:hf-to-gguf:blk.0.attn_q.weight,          torch.bfloat16 --> BF16, shape = {2048, 3072}
INFO:hf-to-gguf:blk.1.attn_norm.weight,       torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.1.ffn_down_exps.weight,   torch.bfloat16 --> BF16, shape = {1408, 2048, 64}
INFO:hf-to-gguf:blk.1.ffn_gate_exps.weight,   torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.1.ffn_up_exps.weight,     torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.1.ffn_gate_inp.weight,    torch.bfloat16 --> F32, shape = {2048, 64}
INFO:hf-to-gguf:blk.1.ffn_down_shexp.weight,  torch.bfloat16 --> BF16, shape = {2816, 2048}
INFO:hf-to-gguf:blk.1.ffn_gate_shexp.weight,  torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.1.ffn_up_shexp.weight,    torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.1.ffn_norm.weight,        torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.1.attn_kv_a_norm.weight,  torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.1.attn_kv_a_mqa.weight,   torch.bfloat16 --> BF16, shape = {2048, 576}
INFO:hf-to-gguf:blk.1.attn_kv_b.weight,       torch.bfloat16 --> BF16, shape = {512, 4096}
INFO:hf-to-gguf:blk.1.attn_output.weight,     torch.bfloat16 --> BF16, shape = {2048, 2048}
INFO:hf-to-gguf:blk.1.attn_q.weight,          torch.bfloat16 --> BF16, shape = {2048, 3072}
INFO:hf-to-gguf:blk.2.attn_norm.weight,       torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.2.ffn_down_exps.weight,   torch.bfloat16 --> BF16, shape = {1408, 2048, 64}
INFO:hf-to-gguf:blk.2.ffn_gate_exps.weight,   torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.2.ffn_up_exps.weight,     torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.2.ffn_gate_inp.weight,    torch.bfloat16 --> F32, shape = {2048, 64}
INFO:hf-to-gguf:blk.2.ffn_down_shexp.weight,  torch.bfloat16 --> BF16, shape = {2816, 2048}
INFO:hf-to-gguf:blk.2.ffn_gate_shexp.weight,  torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.2.ffn_up_shexp.weight,    torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.2.ffn_norm.weight,        torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.2.attn_kv_a_norm.weight,  torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.2.attn_kv_a_mqa.weight,   torch.bfloat16 --> BF16, shape = {2048, 576}
INFO:hf-to-gguf:blk.2.attn_kv_b.weight,       torch.bfloat16 --> BF16, shape = {512, 4096}
INFO:hf-to-gguf:blk.2.attn_output.weight,     torch.bfloat16 --> BF16, shape = {2048, 2048}
INFO:hf-to-gguf:blk.2.attn_q.weight,          torch.bfloat16 --> BF16, shape = {2048, 3072}
INFO:hf-to-gguf:blk.3.attn_norm.weight,       torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.3.ffn_down_exps.weight,   torch.bfloat16 --> BF16, shape = {1408, 2048, 64}
INFO:hf-to-gguf:blk.3.ffn_gate_exps.weight,   torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.3.ffn_up_exps.weight,     torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.3.ffn_gate_inp.weight,    torch.bfloat16 --> F32, shape = {2048, 64}
INFO:hf-to-gguf:blk.3.ffn_down_shexp.weight,  torch.bfloat16 --> BF16, shape = {2816, 2048}
INFO:hf-to-gguf:blk.3.ffn_gate_shexp.weight,  torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.3.ffn_up_shexp.weight,    torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.3.ffn_norm.weight,        torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.3.attn_kv_a_norm.weight,  torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.3.attn_kv_a_mqa.weight,   torch.bfloat16 --> BF16, shape = {2048, 576}
INFO:hf-to-gguf:blk.3.attn_kv_b.weight,       torch.bfloat16 --> BF16, shape = {512, 4096}
INFO:hf-to-gguf:blk.3.attn_output.weight,     torch.bfloat16 --> BF16, shape = {2048, 2048}
INFO:hf-to-gguf:blk.3.attn_q.weight,          torch.bfloat16 --> BF16, shape = {2048, 3072}
INFO:hf-to-gguf:blk.4.attn_norm.weight,       torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.4.ffn_down_exps.weight,   torch.bfloat16 --> BF16, shape = {1408, 2048, 64}
INFO:hf-to-gguf:blk.4.ffn_gate_exps.weight,   torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.4.ffn_up_exps.weight,     torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.4.ffn_gate_inp.weight,    torch.bfloat16 --> F32, shape = {2048, 64}
INFO:hf-to-gguf:blk.4.ffn_down_shexp.weight,  torch.bfloat16 --> BF16, shape = {2816, 2048}
INFO:hf-to-gguf:blk.4.ffn_gate_shexp.weight,  torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.4.ffn_up_shexp.weight,    torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.4.ffn_norm.weight,        torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.4.attn_kv_a_norm.weight,  torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.4.attn_kv_a_mqa.weight,   torch.bfloat16 --> BF16, shape = {2048, 576}
INFO:hf-to-gguf:blk.4.attn_kv_b.weight,       torch.bfloat16 --> BF16, shape = {512, 4096}
INFO:hf-to-gguf:blk.4.attn_output.weight,     torch.bfloat16 --> BF16, shape = {2048, 2048}
INFO:hf-to-gguf:blk.4.attn_q.weight,          torch.bfloat16 --> BF16, shape = {2048, 3072}
INFO:hf-to-gguf:blk.5.attn_norm.weight,       torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.5.ffn_down_exps.weight,   torch.bfloat16 --> BF16, shape = {1408, 2048, 64}
INFO:hf-to-gguf:blk.5.ffn_gate_exps.weight,   torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.5.ffn_up_exps.weight,     torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.5.ffn_gate_inp.weight,    torch.bfloat16 --> F32, shape = {2048, 64}
INFO:hf-to-gguf:blk.5.ffn_down_shexp.weight,  torch.bfloat16 --> BF16, shape = {2816, 2048}
INFO:hf-to-gguf:blk.5.ffn_gate_shexp.weight,  torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.5.ffn_up_shexp.weight,    torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.5.ffn_norm.weight,        torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.5.attn_kv_a_norm.weight,  torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.5.attn_kv_a_mqa.weight,   torch.bfloat16 --> BF16, shape = {2048, 576}
INFO:hf-to-gguf:blk.5.attn_kv_b.weight,       torch.bfloat16 --> BF16, shape = {512, 4096}
INFO:hf-to-gguf:blk.5.attn_output.weight,     torch.bfloat16 --> BF16, shape = {2048, 2048}
INFO:hf-to-gguf:blk.5.attn_q.weight,          torch.bfloat16 --> BF16, shape = {2048, 3072}
INFO:hf-to-gguf:blk.6.attn_norm.weight,       torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.6.ffn_down_exps.weight,   torch.bfloat16 --> BF16, shape = {1408, 2048, 64}
INFO:hf-to-gguf:blk.6.ffn_gate_exps.weight,   torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.6.ffn_up_exps.weight,     torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.6.ffn_gate_inp.weight,    torch.bfloat16 --> F32, shape = {2048, 64}
INFO:hf-to-gguf:blk.6.ffn_down_shexp.weight,  torch.bfloat16 --> BF16, shape = {2816, 2048}
INFO:hf-to-gguf:blk.6.ffn_gate_shexp.weight,  torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.6.ffn_up_shexp.weight,    torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.6.ffn_norm.weight,        torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.6.attn_kv_a_norm.weight,  torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.6.attn_kv_a_mqa.weight,   torch.bfloat16 --> BF16, shape = {2048, 576}
INFO:hf-to-gguf:blk.6.attn_kv_b.weight,       torch.bfloat16 --> BF16, shape = {512, 4096}
INFO:hf-to-gguf:blk.6.attn_output.weight,     torch.bfloat16 --> BF16, shape = {2048, 2048}
INFO:hf-to-gguf:blk.6.attn_q.weight,          torch.bfloat16 --> BF16, shape = {2048, 3072}
INFO:hf-to-gguf:blk.7.ffn_gate_inp.weight,    torch.bfloat16 --> F32, shape = {2048, 64}
INFO:hf-to-gguf:blk.7.ffn_down_shexp.weight,  torch.bfloat16 --> BF16, shape = {2816, 2048}
INFO:hf-to-gguf:blk.7.ffn_gate_shexp.weight,  torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.7.ffn_up_shexp.weight,    torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.7.attn_kv_a_norm.weight,  torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.7.attn_kv_a_mqa.weight,   torch.bfloat16 --> BF16, shape = {2048, 576}
INFO:hf-to-gguf:blk.7.attn_kv_b.weight,       torch.bfloat16 --> BF16, shape = {512, 4096}
INFO:hf-to-gguf:blk.7.attn_output.weight,     torch.bfloat16 --> BF16, shape = {2048, 2048}
INFO:hf-to-gguf:blk.7.attn_q.weight,          torch.bfloat16 --> BF16, shape = {2048, 3072}
INFO:hf-to-gguf:output_norm.weight,           torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:gguf: loading model part 'model-00002-of-000004.safetensors'
INFO:hf-to-gguf:blk.10.attn_norm.weight,      torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.10.ffn_down_exps.weight,  torch.bfloat16 --> BF16, shape = {1408, 2048, 64}
INFO:hf-to-gguf:blk.10.ffn_gate_exps.weight,  torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.10.ffn_up_exps.weight,    torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.10.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {2048, 64}
INFO:hf-to-gguf:blk.10.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2816, 2048}
INFO:hf-to-gguf:blk.10.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.10.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.10.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.10.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.10.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {2048, 576}
INFO:hf-to-gguf:blk.10.attn_kv_b.weight,      torch.bfloat16 --> BF16, shape = {512, 4096}
INFO:hf-to-gguf:blk.10.attn_output.weight,    torch.bfloat16 --> BF16, shape = {2048, 2048}
INFO:hf-to-gguf:blk.10.attn_q.weight,         torch.bfloat16 --> BF16, shape = {2048, 3072}
INFO:hf-to-gguf:blk.11.attn_norm.weight,      torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.11.ffn_down_exps.weight,  torch.bfloat16 --> BF16, shape = {1408, 2048, 64}
INFO:hf-to-gguf:blk.11.ffn_gate_exps.weight,  torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.11.ffn_up_exps.weight,    torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.11.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {2048, 64}
INFO:hf-to-gguf:blk.11.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2816, 2048}
INFO:hf-to-gguf:blk.11.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.11.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.11.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.11.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.11.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {2048, 576}
INFO:hf-to-gguf:blk.11.attn_kv_b.weight,      torch.bfloat16 --> BF16, shape = {512, 4096}
INFO:hf-to-gguf:blk.11.attn_output.weight,    torch.bfloat16 --> BF16, shape = {2048, 2048}
INFO:hf-to-gguf:blk.11.attn_q.weight,         torch.bfloat16 --> BF16, shape = {2048, 3072}
INFO:hf-to-gguf:blk.12.attn_norm.weight,      torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.12.ffn_down_exps.weight,  torch.bfloat16 --> BF16, shape = {1408, 2048, 64}
INFO:hf-to-gguf:blk.12.ffn_gate_exps.weight,  torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.12.ffn_up_exps.weight,    torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.12.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {2048, 64}
INFO:hf-to-gguf:blk.12.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2816, 2048}
INFO:hf-to-gguf:blk.12.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.12.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.12.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.12.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.12.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {2048, 576}
INFO:hf-to-gguf:blk.12.attn_kv_b.weight,      torch.bfloat16 --> BF16, shape = {512, 4096}
INFO:hf-to-gguf:blk.12.attn_output.weight,    torch.bfloat16 --> BF16, shape = {2048, 2048}
INFO:hf-to-gguf:blk.12.attn_q.weight,         torch.bfloat16 --> BF16, shape = {2048, 3072}
INFO:hf-to-gguf:blk.13.attn_norm.weight,      torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.13.ffn_down_exps.weight,  torch.bfloat16 --> BF16, shape = {1408, 2048, 64}
INFO:hf-to-gguf:blk.13.ffn_gate_exps.weight,  torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.13.ffn_up_exps.weight,    torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.13.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {2048, 64}
INFO:hf-to-gguf:blk.13.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2816, 2048}
INFO:hf-to-gguf:blk.13.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.13.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.13.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.13.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.13.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {2048, 576}
INFO:hf-to-gguf:blk.13.attn_kv_b.weight,      torch.bfloat16 --> BF16, shape = {512, 4096}
INFO:hf-to-gguf:blk.13.attn_output.weight,    torch.bfloat16 --> BF16, shape = {2048, 2048}
INFO:hf-to-gguf:blk.13.attn_q.weight,         torch.bfloat16 --> BF16, shape = {2048, 3072}
INFO:hf-to-gguf:blk.14.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {2048, 64}
INFO:hf-to-gguf:blk.14.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2816, 2048}
INFO:hf-to-gguf:blk.14.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.14.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.14.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.14.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {2048, 576}
INFO:hf-to-gguf:blk.14.attn_kv_b.weight,      torch.bfloat16 --> BF16, shape = {512, 4096}
INFO:hf-to-gguf:blk.14.attn_output.weight,    torch.bfloat16 --> BF16, shape = {2048, 2048}
INFO:hf-to-gguf:blk.14.attn_q.weight,         torch.bfloat16 --> BF16, shape = {2048, 3072}
INFO:hf-to-gguf:blk.7.attn_norm.weight,       torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.7.ffn_down_exps.weight,   torch.bfloat16 --> BF16, shape = {1408, 2048, 64}
INFO:hf-to-gguf:blk.7.ffn_gate_exps.weight,   torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.7.ffn_up_exps.weight,     torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.7.ffn_norm.weight,        torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.8.attn_norm.weight,       torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.8.ffn_down_exps.weight,   torch.bfloat16 --> BF16, shape = {1408, 2048, 64}
INFO:hf-to-gguf:blk.8.ffn_gate_exps.weight,   torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.8.ffn_up_exps.weight,     torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.8.ffn_gate_inp.weight,    torch.bfloat16 --> F32, shape = {2048, 64}
INFO:hf-to-gguf:blk.8.ffn_down_shexp.weight,  torch.bfloat16 --> BF16, shape = {2816, 2048}
INFO:hf-to-gguf:blk.8.ffn_gate_shexp.weight,  torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.8.ffn_up_shexp.weight,    torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.8.ffn_norm.weight,        torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.8.attn_kv_a_norm.weight,  torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.8.attn_kv_a_mqa.weight,   torch.bfloat16 --> BF16, shape = {2048, 576}
INFO:hf-to-gguf:blk.8.attn_kv_b.weight,       torch.bfloat16 --> BF16, shape = {512, 4096}
INFO:hf-to-gguf:blk.8.attn_output.weight,     torch.bfloat16 --> BF16, shape = {2048, 2048}
INFO:hf-to-gguf:blk.8.attn_q.weight,          torch.bfloat16 --> BF16, shape = {2048, 3072}
INFO:hf-to-gguf:blk.9.attn_norm.weight,       torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.9.ffn_down_exps.weight,   torch.bfloat16 --> BF16, shape = {1408, 2048, 64}
INFO:hf-to-gguf:blk.9.ffn_gate_exps.weight,   torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.9.ffn_up_exps.weight,     torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.9.ffn_gate_inp.weight,    torch.bfloat16 --> F32, shape = {2048, 64}
INFO:hf-to-gguf:blk.9.ffn_down_shexp.weight,  torch.bfloat16 --> BF16, shape = {2816, 2048}
INFO:hf-to-gguf:blk.9.ffn_gate_shexp.weight,  torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.9.ffn_up_shexp.weight,    torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.9.ffn_norm.weight,        torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.9.attn_kv_a_norm.weight,  torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.9.attn_kv_a_mqa.weight,   torch.bfloat16 --> BF16, shape = {2048, 576}
INFO:hf-to-gguf:blk.9.attn_kv_b.weight,       torch.bfloat16 --> BF16, shape = {512, 4096}
INFO:hf-to-gguf:blk.9.attn_output.weight,     torch.bfloat16 --> BF16, shape = {2048, 2048}
INFO:hf-to-gguf:blk.9.attn_q.weight,          torch.bfloat16 --> BF16, shape = {2048, 3072}
INFO:hf-to-gguf:gguf: loading model part 'model-00003-of-000004.safetensors'
INFO:hf-to-gguf:blk.14.attn_norm.weight,      torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.14.ffn_down_exps.weight,  torch.bfloat16 --> BF16, shape = {1408, 2048, 64}
INFO:hf-to-gguf:blk.14.ffn_gate_exps.weight,  torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.14.ffn_up_exps.weight,    torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.14.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.15.attn_norm.weight,      torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.15.ffn_down_exps.weight,  torch.bfloat16 --> BF16, shape = {1408, 2048, 64}
INFO:hf-to-gguf:blk.15.ffn_gate_exps.weight,  torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.15.ffn_up_exps.weight,    torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.15.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {2048, 64}
INFO:hf-to-gguf:blk.15.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2816, 2048}
INFO:hf-to-gguf:blk.15.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.15.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.15.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.15.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.15.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {2048, 576}
INFO:hf-to-gguf:blk.15.attn_kv_b.weight,      torch.bfloat16 --> BF16, shape = {512, 4096}
INFO:hf-to-gguf:blk.15.attn_output.weight,    torch.bfloat16 --> BF16, shape = {2048, 2048}
INFO:hf-to-gguf:blk.15.attn_q.weight,         torch.bfloat16 --> BF16, shape = {2048, 3072}
INFO:hf-to-gguf:blk.16.attn_norm.weight,      torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.16.ffn_down_exps.weight,  torch.bfloat16 --> BF16, shape = {1408, 2048, 64}
INFO:hf-to-gguf:blk.16.ffn_gate_exps.weight,  torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.16.ffn_up_exps.weight,    torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.16.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {2048, 64}
INFO:hf-to-gguf:blk.16.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2816, 2048}
INFO:hf-to-gguf:blk.16.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.16.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.16.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.16.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.16.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {2048, 576}
INFO:hf-to-gguf:blk.16.attn_kv_b.weight,      torch.bfloat16 --> BF16, shape = {512, 4096}
INFO:hf-to-gguf:blk.16.attn_output.weight,    torch.bfloat16 --> BF16, shape = {2048, 2048}
INFO:hf-to-gguf:blk.16.attn_q.weight,         torch.bfloat16 --> BF16, shape = {2048, 3072}
INFO:hf-to-gguf:blk.17.attn_norm.weight,      torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.17.ffn_down_exps.weight,  torch.bfloat16 --> BF16, shape = {1408, 2048, 64}
INFO:hf-to-gguf:blk.17.ffn_gate_exps.weight,  torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.17.ffn_up_exps.weight,    torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.17.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {2048, 64}
INFO:hf-to-gguf:blk.17.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2816, 2048}
INFO:hf-to-gguf:blk.17.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.17.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.17.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.17.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.17.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {2048, 576}
INFO:hf-to-gguf:blk.17.attn_kv_b.weight,      torch.bfloat16 --> BF16, shape = {512, 4096}
INFO:hf-to-gguf:blk.17.attn_output.weight,    torch.bfloat16 --> BF16, shape = {2048, 2048}
INFO:hf-to-gguf:blk.17.attn_q.weight,         torch.bfloat16 --> BF16, shape = {2048, 3072}
INFO:hf-to-gguf:blk.18.attn_norm.weight,      torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.18.ffn_down_exps.weight,  torch.bfloat16 --> BF16, shape = {1408, 2048, 64}
INFO:hf-to-gguf:blk.18.ffn_gate_exps.weight,  torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.18.ffn_up_exps.weight,    torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.18.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {2048, 64}
INFO:hf-to-gguf:blk.18.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2816, 2048}
INFO:hf-to-gguf:blk.18.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.18.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.18.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.18.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.18.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {2048, 576}
INFO:hf-to-gguf:blk.18.attn_kv_b.weight,      torch.bfloat16 --> BF16, shape = {512, 4096}
INFO:hf-to-gguf:blk.18.attn_output.weight,    torch.bfloat16 --> BF16, shape = {2048, 2048}
INFO:hf-to-gguf:blk.18.attn_q.weight,         torch.bfloat16 --> BF16, shape = {2048, 3072}
INFO:hf-to-gguf:blk.19.attn_norm.weight,      torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.19.ffn_down_exps.weight,  torch.bfloat16 --> BF16, shape = {1408, 2048, 64}
INFO:hf-to-gguf:blk.19.ffn_gate_exps.weight,  torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.19.ffn_up_exps.weight,    torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.19.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {2048, 64}
INFO:hf-to-gguf:blk.19.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2816, 2048}
INFO:hf-to-gguf:blk.19.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.19.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.19.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.19.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.19.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {2048, 576}
INFO:hf-to-gguf:blk.19.attn_kv_b.weight,      torch.bfloat16 --> BF16, shape = {512, 4096}
INFO:hf-to-gguf:blk.19.attn_output.weight,    torch.bfloat16 --> BF16, shape = {2048, 2048}
INFO:hf-to-gguf:blk.19.attn_q.weight,         torch.bfloat16 --> BF16, shape = {2048, 3072}
INFO:hf-to-gguf:blk.20.attn_norm.weight,      torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.20.ffn_down_exps.weight,  torch.bfloat16 --> BF16, shape = {1408, 2048, 64}
INFO:hf-to-gguf:blk.20.ffn_gate_exps.weight,  torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.20.ffn_up_exps.weight,    torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.20.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {2048, 64}
INFO:hf-to-gguf:blk.20.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2816, 2048}
INFO:hf-to-gguf:blk.20.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.20.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.20.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.20.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.20.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {2048, 576}
INFO:hf-to-gguf:blk.20.attn_kv_b.weight,      torch.bfloat16 --> BF16, shape = {512, 4096}
INFO:hf-to-gguf:blk.20.attn_output.weight,    torch.bfloat16 --> BF16, shape = {2048, 2048}
INFO:hf-to-gguf:blk.20.attn_q.weight,         torch.bfloat16 --> BF16, shape = {2048, 3072}
INFO:hf-to-gguf:blk.21.attn_norm.weight,      torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.21.ffn_down_exps.weight,  torch.bfloat16 --> BF16, shape = {1408, 2048, 64}
INFO:hf-to-gguf:blk.21.ffn_gate_exps.weight,  torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.21.ffn_up_exps.weight,    torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.21.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {2048, 64}
INFO:hf-to-gguf:blk.21.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2816, 2048}
INFO:hf-to-gguf:blk.21.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.21.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.21.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.21.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.21.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {2048, 576}
INFO:hf-to-gguf:blk.21.attn_kv_b.weight,      torch.bfloat16 --> BF16, shape = {512, 4096}
INFO:hf-to-gguf:blk.21.attn_output.weight,    torch.bfloat16 --> BF16, shape = {2048, 2048}
INFO:hf-to-gguf:blk.21.attn_q.weight,         torch.bfloat16 --> BF16, shape = {2048, 3072}
INFO:hf-to-gguf:blk.22.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {2048, 64}
INFO:hf-to-gguf:blk.22.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2816, 2048}
INFO:hf-to-gguf:blk.22.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.22.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.22.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.22.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {2048, 576}
INFO:hf-to-gguf:blk.22.attn_kv_b.weight,      torch.bfloat16 --> BF16, shape = {512, 4096}
INFO:hf-to-gguf:blk.22.attn_output.weight,    torch.bfloat16 --> BF16, shape = {2048, 2048}
INFO:hf-to-gguf:blk.22.attn_q.weight,         torch.bfloat16 --> BF16, shape = {2048, 3072}
INFO:hf-to-gguf:gguf: loading model part 'model-00004-of-000004.safetensors'
INFO:hf-to-gguf:blk.22.attn_norm.weight,      torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.22.ffn_down_exps.weight,  torch.bfloat16 --> BF16, shape = {1408, 2048, 64}
INFO:hf-to-gguf:blk.22.ffn_gate_exps.weight,  torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.22.ffn_up_exps.weight,    torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.22.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.23.attn_norm.weight,      torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.23.ffn_down_exps.weight,  torch.bfloat16 --> BF16, shape = {1408, 2048, 64}
INFO:hf-to-gguf:blk.23.ffn_gate_exps.weight,  torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.23.ffn_up_exps.weight,    torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.23.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {2048, 64}
INFO:hf-to-gguf:blk.23.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2816, 2048}
INFO:hf-to-gguf:blk.23.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.23.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.23.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.23.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.23.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {2048, 576}
INFO:hf-to-gguf:blk.23.attn_kv_b.weight,      torch.bfloat16 --> BF16, shape = {512, 4096}
INFO:hf-to-gguf:blk.23.attn_output.weight,    torch.bfloat16 --> BF16, shape = {2048, 2048}
INFO:hf-to-gguf:blk.23.attn_q.weight,         torch.bfloat16 --> BF16, shape = {2048, 3072}
INFO:hf-to-gguf:blk.24.attn_norm.weight,      torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.24.ffn_down_exps.weight,  torch.bfloat16 --> BF16, shape = {1408, 2048, 64}
INFO:hf-to-gguf:blk.24.ffn_gate_exps.weight,  torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.24.ffn_up_exps.weight,    torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.24.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {2048, 64}
INFO:hf-to-gguf:blk.24.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2816, 2048}
INFO:hf-to-gguf:blk.24.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.24.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.24.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.24.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.24.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {2048, 576}
INFO:hf-to-gguf:blk.24.attn_kv_b.weight,      torch.bfloat16 --> BF16, shape = {512, 4096}
INFO:hf-to-gguf:blk.24.attn_output.weight,    torch.bfloat16 --> BF16, shape = {2048, 2048}
INFO:hf-to-gguf:blk.24.attn_q.weight,         torch.bfloat16 --> BF16, shape = {2048, 3072}
INFO:hf-to-gguf:blk.25.attn_norm.weight,      torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.25.ffn_down_exps.weight,  torch.bfloat16 --> BF16, shape = {1408, 2048, 64}
INFO:hf-to-gguf:blk.25.ffn_gate_exps.weight,  torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.25.ffn_up_exps.weight,    torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.25.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {2048, 64}
INFO:hf-to-gguf:blk.25.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2816, 2048}
INFO:hf-to-gguf:blk.25.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.25.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.25.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.25.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.25.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {2048, 576}
INFO:hf-to-gguf:blk.25.attn_kv_b.weight,      torch.bfloat16 --> BF16, shape = {512, 4096}
INFO:hf-to-gguf:blk.25.attn_output.weight,    torch.bfloat16 --> BF16, shape = {2048, 2048}
INFO:hf-to-gguf:blk.25.attn_q.weight,         torch.bfloat16 --> BF16, shape = {2048, 3072}
INFO:hf-to-gguf:blk.26.attn_norm.weight,      torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.26.ffn_down_exps.weight,  torch.bfloat16 --> BF16, shape = {1408, 2048, 64}
INFO:hf-to-gguf:blk.26.ffn_gate_exps.weight,  torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.26.ffn_up_exps.weight,    torch.bfloat16 --> BF16, shape = {2048, 1408, 64}
INFO:hf-to-gguf:blk.26.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {2048, 64}
INFO:hf-to-gguf:blk.26.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2816, 2048}
INFO:hf-to-gguf:blk.26.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.26.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {2048, 2816}
INFO:hf-to-gguf:blk.26.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.26.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.26.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {2048, 576}
INFO:hf-to-gguf:blk.26.attn_kv_b.weight,      torch.bfloat16 --> BF16, shape = {512, 4096}
INFO:hf-to-gguf:blk.26.attn_output.weight,    torch.bfloat16 --> BF16, shape = {2048, 2048}
INFO:hf-to-gguf:blk.26.attn_q.weight,         torch.bfloat16 --> BF16, shape = {2048, 3072}
Writing: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 31.4G/31.4G [01:31<00:00, 343Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to '/mnt/llm/models/DeepSeek-Coder-V2-Lite-Instruct.bf16.bin'

sammcj commented 1 week ago

The model crashes llama.cpp when run in bf16:

llm_load_print_meta: rope_yarn_log_mul    = 0.0707
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA RTX A4000, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.53 MiB
llm_load_tensors: offloading 27 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 28/28 layers to GPU
llm_load_tensors:        CPU buffer size =   400.00 MiB
llm_load_tensors:      CUDA0 buffer size = 18006.80 MiB
llm_load_tensors:      CUDA1 buffer size = 11557.68 MiB
.......................................................................................
llama_new_context_with_model: flash_attn requires n_embd_head_k == n_embd_head_v - forcing off
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init:      CUDA0 KV buffer size =    85.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =    50.00 MiB
llama_new_context_with_model: KV self size  =  135.00 MiB, K (f16):   81.00 MiB, V (f16):   54.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.39 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=64)
llama_new_context_with_model:      CUDA0 compute buffer size =   370.88 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =   532.25 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    68.38 MiB
llama_new_context_with_model: graph nodes  = 1924
llama_new_context_with_model: graph splits = 3
GGML_ASSERT: ggml-cuda.cu:1257: to_fp32_cuda != nullptr
Aborted (core dumped)

However, it works in fp16, but outputs at least part of the time in Chinese:

main: build = 1 (d50f889)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1718889658
llama_model_loader: loaded meta data with 38 key-value pairs and 377 tensors from /models/DeepSeek-Coder-V2-Lite-Instruct.f16.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.name str              = deepseek-ai_DeepSeek-Coder-V2-Lite-In...
llama_model_loader: - kv   2:                      deepseek2.block_count u32              = 27
llama_model_loader: - kv   3:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv   4:                 deepseek2.embedding_length u32              = 2048
llama_model_loader: - kv   5:              deepseek2.feed_forward_length u32              = 10944
llama_model_loader: - kv   6:             deepseek2.attention.head_count u32              = 16
llama_model_loader: - kv   7:          deepseek2.attention.head_count_kv u32              = 16
llama_model_loader: - kv   8:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv   9: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                deepseek2.expert_used_count u32              = 6
llama_model_loader: - kv  11:                          general.file_type u32              = 1
llama_model_loader: - kv  12:        deepseek2.leading_dense_block_count u32              = 1
llama_model_loader: - kv  13:                       deepseek2.vocab_size u32              = 102400
llama_model_loader: - kv  14:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  15:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  16:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  17:       deepseek2.expert_feed_forward_length u32              = 1408
llama_model_loader: - kv  18:                     deepseek2.expert_count u32              = 64
llama_model_loader: - kv  19:              deepseek2.expert_shared_count u32              = 2
llama_model_loader: - kv  20:             deepseek2.expert_weights_scale f32              = 1.000000
llama_model_loader: - kv  21:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  22:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  23:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  24: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  25: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.070700
llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = deepseek-llm
llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,102400]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,102400]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  30:                      tokenizer.ggml.merges arr[str,99757]   = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e...
llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 100000
llama_model_loader: - kv  32:                tokenizer.ggml.eos_token_id u32              = 100001
llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 100001
llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  35:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  37:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  108 tensors
llama_model_loader: - type  f16:  269 tensors
llm_load_vocab: special tokens cache size = 2400
llm_load_vocab: token to piece cache size = 0.6661 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = deepseek2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 102400
llm_load_print_meta: n_merges         = 99757
llm_load_print_meta: n_ctx_train      = 163840
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_layer          = 27
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_embd_head_k    = 192
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 3072
llm_load_print_meta: n_embd_v_gqa     = 2048
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 10944
llm_load_print_meta: n_expert         = 64
llm_load_print_meta: n_expert_used    = 6
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = yarn
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 0.025
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 16B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 15.71 B
llm_load_print_meta: model size       = 29.26 GiB (16.00 BPW)
llm_load_print_meta: general.name     = deepseek-ai_DeepSeek-Coder-V2-Lite-Instruct
llm_load_print_meta: BOS token        = 100000 '<｜begin▁of▁sentence｜>'
llm_load_print_meta: EOS token        = 100001 '<｜end▁of▁sentence｜>'
llm_load_print_meta: PAD token        = 100001 '<｜end▁of▁sentence｜>'
llm_load_print_meta: LF token         = 126 'Ä'
llm_load_print_meta: n_layer_dense_lead   = 1
llm_load_print_meta: n_lora_q             = 0
llm_load_print_meta: n_lora_kv            = 512
llm_load_print_meta: n_ff_exp             = 1408
llm_load_print_meta: n_expert_shared      = 2
llm_load_print_meta: expert_weights_scale = 1.0
llm_load_print_meta: rope_yarn_log_mul    = 0.0707
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA RTX A4000, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.53 MiB
llm_load_tensors: offloading 27 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 28/28 layers to GPU
llm_load_tensors:        CPU buffer size =   400.00 MiB
llm_load_tensors:      CUDA0 buffer size = 18006.80 MiB
llm_load_tensors:      CUDA1 buffer size = 11557.68 MiB
.......................................................................................
llama_new_context_with_model: flash_attn requires n_embd_head_k == n_embd_head_v - forcing off
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init:      CUDA0 KV buffer size =    85.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =    50.00 MiB
llama_new_context_with_model: KV self size  =  135.00 MiB, K (f16):   81.00 MiB, V (f16):   54.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.39 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=64)
llama_new_context_with_model:      CUDA0 compute buffer size =   370.88 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =   532.25 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    68.38 MiB
llama_new_context_with_model: graph nodes  = 1924
llama_new_context_with_model: graph splits = 3

system_info: n_threads = 6 / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling:
    repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
    top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 2048, n_predict = -1, n_keep = 1

Show me a code snippet of a website's sticky header in CSS and JavaScript.

Here's a simple example of a sticky header using CSS and JavaScript:

```html
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Sticky Header Example</title>
    <style>
        body {
            font-family: Arial, sans-serif;
            margin: 0;
            padding: 0;
            line-height: 1.6;
        }
        .header {
            background-color: #333;
            color: #fff;
            padding: 15px 20px;
            text-align: center;
            position: sticky;
            top: 0;
            width: 100%;
            z-index: 1000;
        }
        .content {
            padding: 20px;
        }
    </style>
</head>
<body>
    <div class="header">
        <h1>My Sticky Header</h1>
    </div>
    <div class="content">
        <p>Scroll down to see the sticky header in action.</p>
        <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>
        <p>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</p>
        <p>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.</p>
        <p>Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>
        <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eius大家好大家好大家好大家好大家好大家好大家好大家好大家好trap大家好Patentongue税务总局大家好大家好 behalf behalf大家好 behalf大家好 behalf behalf behalf behalf大家好 behalf behalf behalf behalf behalf大家好 behalf大家好大家好大家好大家好大家好大家好 behalf大家好 behalf大家好大家好 dieonguetrap大家好大家好大家好大家好大家好大家好 behalf大家好大家好大家好大家好 behalfongueleton大家好大家好 behalftrap大家好net大家好trapongueSingleton大家好大家好大家好大家好大家好大家好大家好 behalfpongongue怀大家好leton大家好 behalfleton大家好 behalf behalf大家好trap大家好 behalfleton大家好 behalfleton大家好 behalfleton behalfletonletontrapletontraponguestick大家好Patentletontrappongleton大家好leton大家好letontrapongueslaveongue夹trapongueCustom behalftrap大家好 parasitesletontrap behalf behalf behalfletontraptrapongueColdletontrapongueletontrap大家好letontrap大家好oyerleton大家好icide大家好ongue Край大家好onguehang大家好leton大家好ongueCold大家好leton大家好trapongueflat大家好 behalfleton大家好 behalfonguestick大家好 behalfletontrapongueCold大家好netMSSQLleton大家好大家好leton大家好leton die大家好leton大家好letontrapletonletonleton大家好letonleton大家好leton大家好icide大家好onguehang dieonguehangtraponguehang dieongue custom汇报 behalfletontrapinus behalfongue cooler dieongue打下leton billetonstrapole大家好MSSQLleton die条约mongleton die援助mongleton die大家好onguemongleton die dieotto die大家好 behalf die diediediediediediediedie大家好万桶大家好万桶大家好万桶pong die吊trap大家好APIENTRYleton die吊trap die吊Customпата Hang HangHang HangstickSlaveCustomponghangHanghangHangHang HangCareCareCustomflat blinkkickmacroPrime Bellev HangmacroHangCustomHang HangHangCustomHang Hang HangHanghangstick HangSlave HangHang HangCustomSlaveHang forts HangkickChiefHangmacrohang Hang Hang HangCare Hangcustom Hang Hangstick HangkickStick Hang HangHanghangmacroprimprim chiefsSlaveprimeCustom Hang fortshangmacromacroSlaveCare Hanghangstickprimer Hang forts HangStickSlave HangpongHangstickprim blinkPrimer HangStick HangBre Hang forts blinkHangprimerprimCustomHang HangprimerCustom currystickprimeпатаHangStick HangSlavehanghang Hang Hang Hang HangHangprideпатаHangHang gravesHangbreakerHang fortsCareHangpongliftponghangpongponghangSlave Hang HangpongSlavehangpong

llama_print_timings:        load time =   14418.87 ms
llama_print_timings:      sample time =      28.56 ms /   904 runs   (    0.03 ms per token, 31648.23 tokens per second)
llama_print_timings: prompt eval time =      84.66 ms /    18 tokens (    4.70 ms per token,   212.61 tokens per second)
llama_print_timings:        eval time =   13918.12 ms /   903 runs   (   15.41 ms per token,    64.88 tokens per second)
llama_print_timings:       total time =   14468.49 ms /   921 tokens

goodov commented 1 week ago

llama.cpp does not have support for jinja templates, all templates are actually hardcoded. Support for the deepseek-coder-v2 template is currently not hardcoded, so llama.cpp fallbacks to chatml template. This means testing llama.cpp without passing the template explicitly is wrong right now.
ollama's template is incorrect as it has extra space here: Assistant: {{ .Response }} and is missing eos token in continous chat session.

the more or less correct template for ollama is:

{{ if .System }}{{ .System }}

{{ end }}{{ if .Prompt }}User: {{ .Prompt }}

{{ end }}Assistant:{{ .Response }}<｜end▁of▁sentence｜>

this way everything is in English.

the correct Continue.dev template is:

function templateDeepseekCoderV2(msgs: ChatMessage[]): string {
  let prompt = "";

  for (let msg of msgs) {
    if (msg.role === "user") {
      prompt += `User: ${msg.content}\n\n`;
    } else if (msg.role === "assistant") {
      prompt += `Assistant:${msg.content}<｜end▁of▁sentence｜>`;
    } else if (msg.role === "system") {
      prompt += `${msg.content}\n\n`;
    }
  }

  prompt += "Assistant:";

  return prompt;
}

sammcj commented 1 week ago

@goodov thank you - you nailed it! After using your updated template - it works flawlessly following instructions to only output English every time.

FROM ../DeepSeek-Coder-V2-Lite-Instruct-Q6_K_L.gguf

TEMPLATE """{{ if .System }}{{ .System }}

{{ end }}{{ if .Prompt }}User: {{ .Prompt }}

{{ end }}Assistant:{{ .Response }}<｜end▁of▁sentence｜>"""

PARAMETER stop "User:"
PARAMETER stop "Assistant:"
PARAMETER stop "<｜end▁of▁sentence｜>"

@jmorganca, I can't see where to submit this as a PR to Ollama, but I believe the official Ollama templates for DeepSeek Coder v2 need to be updated based on goodov's findings.

Lastly, @guoday thank you for bearing with me while we figured out what's going on. I had several friends really wanting to use DS Coder v2 that gave up thinking it wasn't good for English - now they're excited to give it a go.

sammcj commented 1 week ago

Closing this off as resolved, I've dropped a post on r/LocalLLaMA and updated a conversation thread on the Ollama Discord with the updated template.

deepseek-ai / DeepSeek-Coder-V2

Model always responds in Chinese, ignores system prompts stating to only reply in English #12