ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.16k stars 9.34k forks source link

ggml_validate_row_data finding nan value for IQ4_NL #7311

Closed bartowski1182 closed 4 months ago

bartowski1182 commented 4 months ago

Using b2854

Converted Hermes-2-Theta-Llama-3-8B to F32, then measured imatrix with https://gist.github.com/bartowski1182/b6ac44691e994344625687afe3263b3a

Upon quanting, all sizes work fine, except for IQ4_NL which produces this output:

load_imatrix: imatrix dataset='/training_data/calibration_data.txt'
load_imatrix: loaded 224 importance matrix entries from /models/Hermes-2-Theta-Llama-3-8B-GGUF/Hermes-2-Theta-Llama-3-8B.imatrix computed on 189 chunks
prepare_imatrix: have 224 importance matrix entries
main: build = 2854 (72c177c1)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing '/models/Hermes-2-Theta-Llama-3-8B-GGUF/Hermes-2-Theta-Llama-3-8B-f32.gguf' to '/models/Hermes-2-Theta-Llama-3-8B-GGUF/Hermes-2-Theta-Llama-3-8B-IQ4_NL.gguf' as IQ4_NL
llama_model_loader: loaded meta data with 23 key-value pairs and 291 tensors from /models/Hermes-2-Theta-Llama-3-8B-GGUF/Hermes-2-Theta-Llama-3-8B-f32.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Hermes-2-Theta-Llama-3-8B
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 0
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128003
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 128001
llama_model_loader: - kv  21:                    tokenizer.chat_template str              = {{bos_token}}{% for message in messag...
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  291 tensors
================================ Have weights data with 224 entries
[   1/ 291]                    token_embd.weight - [ 4096, 128256,     1,     1], type =    f32,
====== llama_model_quantize_internal: did not find weights for token_embd.weight
converting to iq4_nl .. ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 128
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 384
ggml_validate_row_data: found nan value at block 128
ggml_validate_row_data: found nan value at block 128
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 128
ggml_validate_row_data: found nan value at block 384
ggml_validate_row_data: found nan value at block 256
ggml_validate_row_data: found nan value at block 256
ggml_validate_row_data: found nan value at block 384
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 256
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 384
ggml_validate_row_data: found nan value at block 128
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 128
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
ggml_validate_row_data: found nan value at block 0
llama_model_quantize: failed to quantize: quantized data validation failed
main: failed to quantize model from '/models/Hermes-2-Theta-Llama-3-8B-GGUF/Hermes-2-Theta-Llama-3-8B-f32.gguf'

When I refer to "all quants" I mean these all work fine:

IQ1_S, IQ1_M, IQ2_XXS, IQ2_XS, IQ2_S, IQ2_M, Q2_K, IQ3_XXS, IQ3_XS, IQ3_S, IQ3_M, Q3_K_S, Q3_K_M, Q3_K_L, IQ4_XS, Q4_K_S, Q4_K_M, Q5_K_S, Q5_K_M, Q6_K, Q8_0

slaren commented 4 months ago

If it is not too much trouble, can you upload the f32 model that you used? I don't think the imatrix matters here since the token embeddings don't use it.

bartowski1182 commented 4 months ago

@slaren no problem at all, uploaded here:

https://huggingface.co/bartowski/Hermes-2-Theta-Llama-3-8B-GGUF/blob/main/Hermes-2-Theta-Llama-3-8B-f32.gguf

bartowski1182 commented 4 months ago

@slaren it happened again with the granite 34B model and Q2_K with your changes (b2928)

I'm out so don't have good access to my logs, the f32 will go up in a couple hours and I'll link you to it, just figured I'd let you know in advanced

bartowski1182 commented 4 months ago

@slaren f32 going up here:

https://huggingface.co/bartowski/granite-34b-code-instruct-GGUF

Failed on others too (Q3_K_S), not sure how many would fail, but failing in same way

I can grab the log if that would help, moved onto other things in the meantime

Any chance that bf16 or f16 wouldn't face this issue?

slaren commented 4 months ago

Any chance that bf16 or f16 wouldn't face this issue?

I don't think so, the tensors are converted to f32 before being quantized regardless.

slaren commented 4 months ago

I didn't get any errors when quantizing to Q3_K_S. It may depend on the imatrix being used, can you upload that too?

bartowski1182 commented 4 months ago

Uploaded