ggerganov / llama.cpp

LLM inference in C/C++
MIT License
66.13k stars 9.5k forks source link

Bug: (CUDA) Corrupted output when offloading to multiple GPUs #8685

Closed matteoserva closed 2 months ago

matteoserva commented 2 months ago

What happened?

Problem

Some models produce a corrupted output when offloading to multiple CUDA GPUs. The problem disappears when offloading to a single GPU or using CPU only.

I was able to reproduce the problem in:

while I was unable to reproduce it in:

Bug 1

When offloading to multiple GPUs, the model gives the wrong answer. It seems unable to parse the prompt correctly.

Bug 2

When a second prompt is sent to the model, the model reuses information from the first prompt.

Steps to reproduce Bug 1

Steps to reproduce Bug 2

The second prompt shares a common prefix with the first prompt.

Full log for the correct answer:

Here is the output:

{
  "answers": {
    "alice": 7,
    "bob": 8,
    "charlie": 7
  },
  "reasoning": "The users provided different answers based on their step-by-step breakdowns of the problem, but they all reached the conclusion that Matteo discards a quarter of his remaining fruits equally between apples and oranges.",
  "who is right": "It's a tie between Alice and Charlie, as they both answered 7 apples remaining.",
  "answer": 7
}

Note that since Alice and Charlie have the same answer, 7, they are considered the "right" answer based on majority voting.

Full log for the wrong answer

Here is the output:

{
  "answers": {
    "alice": 7,
    "bob": 7,
    "charlie": null
  },
  "reasoning": "Both Alice and Bob provided the same answer, which is 7 apples. Since Charlie did not provide an answer, we use majority voting to determine the correct answer.",
  "who is right": "Alice and Bob",
  "answer": 7
}

In this case, both Alice and Bob provided the same answer, which is 7 apples. Since Charlie did not provide an answer, we use majority voting to determine the correct answer, which is 7 apples.

My setup:

Linux with:

Name and Version

version: 3463 (dc820804) built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu

The reference model is llama3.0 8b by bartowski, SHA256: d6f1dcba991f8e629531a5f4cf19e4dbc2a89f80fd20737e256b92aac11572f1

What operating system are you seeing the problem on?

Linux

Relevant log output

No response

JohannesGaessler commented 2 months ago

This is not conclusive evidence of an actual bug. The floating point rounding error will be different for different GPUs so the results between an RTX 4060, an RTX 3060, and a combination of the two will not be bit-for-bit identical. So it is expected that some inputs will yield better or worse results simply by random chance. I would only consider statistically significant differences via llama-perplexity to be conclusive evidence.

I get bit-for-bit identical results between 1x RTX 4090 and 2x RTX 4090 so I don't think there is a bug in the multi GPU code.

matteoserva commented 2 months ago

@JohannesGaessler

The Bug 2 should be a conclusive enough evidence.

When sending the second prompt, the model answers as if it was given the first one.

Here is a rewording of what happened:

  1. I send a problem specifying that the answer must be in a specific json format.
  2. The model answers in json
  3. I send the same problem without specifying the answer format (apples_simple.txt)
  4. The model answers with the same json response using the exact same schema that was specified in the first prompt even if none of this was specified in the prompt.

EDIT:

llama-perplexity gave approximately the same values in both cases, with one and with two GPUs.

JohannesGaessler commented 2 months ago

You didn't post any outputs for the second prompt and even then I don't see any reason why this would be an issue with multi GPU setups in particular. The multi GPU code can only be the problem if it produces numerical outputs that are incorrect beyond differences in floating point rounding error. Information being leaked between prompts should only happen if there is something wrong with KV cache management but I don't see why this would happen here or more specifically why it would only happen with multiple GPUs.

matteoserva commented 2 months ago

For this test I'm only changing the tensor split to limit the differences to a minimum. I tokenized the string in a separate step and I'm sending the tokenized string to the server. The first GPU is the 4060, the second one is the 3060

With ts 100,0, correct replies

With ./llama-server -ngl 175 -t 6 -c 8192 --host 0.0.0.0 -m Meta-Llama-3-8B-Instruct-Q6_K.gguf -fa --override-kv tokenizer.ggml.add_bos_token=bool:false -ts 100,0

result from apples_majority.txt: Here is the output:

json
{
  "answers": {
    "alice": 7,
    "bob": 8,
    "charlie": 7
  },
  "reasoning": "The users provided different answers based on their individual breakdowns of the problem.",
  "who is right": "It's a tie between Alice and Charlie, both of whom answered 7 apples remaining.",
  "answer": 7
}

after apples_simple.txt:

The answers from the individual users are:

* Alice: 7 apples
* Bob: 8 apples
* Charlie: 7 apples

To determine the answer by majority voting, we can count the number of users who answered each option:

* 7 apples: 2 users (Alice and Charlie)
* 8 apples: 1 user (Bob)

Since 2 users answered 7 apples and only 1 user answered 8 apples, the answer chosen by majority voting is:

* 7 apples

Both answers are coherent with the provided prompt.

With -ts 50,50 unwanted replies

./llama-server -ngl 175 -t 6 -c 8192 --host 0.0.0.0 -m Meta-Llama-3-8B-Instruct-Q6_K.gguf -fa --override-kv tokenizer.ggml.add_bos_token=bool:false -ts 50,50

after apples_majority.txt

{
  "answers": {
    "alice": 7,
    "bob": 7,
    "charlie": null
  },
  "reasoning": "Both Alice and Bob provided the same answer, 7 apples, which is the majority vote.",
  "who is right": "Alice and Bob",
  "answer": 7
}

after apples_simple.txt

{
  "answers": {
    "alice": 7,
    "bob": 7,
    "charlie": null
  },
  "reasoning": "Both Alice and Bob arrived at the same answer, 7 apples, despite having slightly different steps. Charlie did not provide an answer.",
  "who is right": "Both Alice and Bob",
  "answer": 7
}

The first answer is wrong but it respects the requested format. The second reply is the answer to the first prompt even if the second prompt was different.

With -ts 75,25

With the split at 75,25 I'm getting all the wrong answers as in the 50,50 case

With -ts 25,75

With the split at 25,75 I'm getting all the correct answers. same as in the 100,0 case

With -ts 0,100

With the split at 0,100 I'm getting all the correct answers. same as in the 100,0 case

JohannesGaessler commented 2 months ago

I am assuming that you are not using prompt caching since it is disabled by default. In that case, do you get the "wrong" JSON reply to apples_simple.txt if you don't run apples_majority.txt?

matteoserva commented 2 months ago

I am assuming that you are not using prompt caching since it is disabled by default. In that case, do you get the "wrong" JSON reply to apples_simple.txt if you don't run apples_majority.txt?

Prompt cache was enabled. I didn't mention it. I'm sorry.

With prompt cache at False I'm not getting the json output from apples_simple. In retrospect it should have been the first thing to check. (I should open a new issue for this). Still strange that the json issue is linked to the choice of tensor split.

Even after removing the cache, the model gives me wrong answers when running on multiple GPUs, for any choice of seed. It seems a different problem than rounding errors. Is there a way to rule out rounding errors? For example running everything in fp32?

Here are the custom parameters I'm now using:

With ts 100,0

With ./llama-server -ngl 175 -t 6 -c 8192 --host 0.0.0.0 -m Meta-Llama-3-8B-Instruct-Q6_K.gguf -fa --override-kv tokenizer.ggml.add_bos_token=bool:false -ts 100,0

result from apples_majority.txt:

Here is the output:

json
{
  "answers": {
    "alice": 7,
    "bob": 8,
    "charlie": 7
  },
  "reasoning": "The users provided different answers based on their individual breakdowns of the problem.",
  "who is right": "It's a tie between Alice and Charlie, both of whom answered 7 apples remaining.",
  "answer": 7
}

after apples_simple.txt:

The answers from the individual users are:

* Alice: 7 apples
* Bob: 8 apples
* Charlie: 7 apples

To determine the answer by majority voting, we can count the number of users who answered each option:

* 7 apples: 2 users (Alice and Charlie)
* 8 apples: 1 user (Bob)

Since 2 users answered 7 apples and only 1 user answered 8 apples, the answer chosen by majority voting is:

* 7 apples

with ts 50,50

apples_majority.txt:

Here is the output:

{
  "answers": {
    "alice": 7,
    "bob": 7,
    "charlie": null
  },
  "reasoning": "Both Alice and Bob provided the same answer, 7 apples, which is the majority vote.",
  "who is right": "Alice and Bob",
  "answer": 7
}

apples_simple.txt

The answers from the individual users are:

* Alice: 7 apples
* Bob: 7 apples
* Charlie: 7 apples

The majority voting result is:

* 7 apples

The output of the problem chosen by majority voting is:

Matteo has 7 apples remaining.
JohannesGaessler commented 2 months ago

Is there a way to rule out rounding errors?

Yes, you can rule out differences in rounding error (which should on average not affect the percentage of correct/wrong answers) by collecting a large amount of data and conducting a statistical analysis of said data. For a simple problem where you're only checking whether the answer is correct or not the uncertainty on the probability $p$ of receiving a correct answer using a sample size of $n$ can be estimated as

$$ \Delta p = \sqrt{\frac{p (1 - p)}{n}} . $$

As you can see you would need a very large sample size of 1000 or more to get good precision (also because this formula is only valid in the large sample limit). That is why I told you to use llama-perplexity where the model can be evaluated on hundreds of thousands of tokens in a way that also estimates the statistical significance.

For example running everything in fp32?

There will be less rounding error if you use FP32 but due to the structure of neural networks it is fundamentally impossible to calculate an upper bound for how much the output will change given a small perturbation of the inputs. It is not a reliable method for ruling out differences in rounding error, only tests with a large sample size and a statistical analysis are.

matteoserva commented 2 months ago

Closing this for now as I'm convinced. Thanks again for your help

matteoserva commented 2 months ago

Hi, @JohannesGaessler I'm reopening this issue because I'm getting consistently worse result when running with multiple GPUs. I rerun the perplexity test. I'm getting a value of

The setup is the same as described in the previous messages. The model is llama 3.0.

Single GPU:

CUDA_VISIBLE_DEVICES=0 ./llama-perplexity -ngl 99 -m Meta-Llama-3-8B-Instruct-Q6_K.gguf -f wikitext-2-raw/wiki.test.raw -fa -c 8192 -s 0
main: build = 3497 (ebb346a2)
main: built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
main: seed  = 0
llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from /home/matteo/tmp/models_cache/Meta-Llama-3-8B-Instruct-Q6_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 18
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - kv  22:                      quantize.imatrix.file str              = /models/Meta-Llama-3-8B-Instruct-GGUF...
llama_model_loader: - kv  23:                   quantize.imatrix.dataset str              = /training_data/groups_merged.txt
llama_model_loader: - kv  24:             quantize.imatrix.entries_count i32              = 224
llama_model_loader: - kv  25:              quantize.imatrix.chunks_count i32              = 88
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q6_K:  226 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q6_K
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 6.14 GiB (6.56 BPW) 
llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   410.98 MiB
llm_load_tensors:      CUDA0 buffer size =  5871.99 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   258.50 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    24.01 MiB
llama_new_context_with_model: graph nodes  = 903
llama_new_context_with_model: graph splits = 2

system_info: n_threads = 6 / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
perplexity: tokenizing the input ..
perplexity: tokenization took 242.98 ms
perplexity: calculating perplexity over 35 chunks, n_ctx=8192, batch_size=2048, n_seq=1
perplexity: 4.45 seconds per pass - ETA 2.58 minutes
[1]9.1824,[2]6.7775,[3]7.1044,[4]7.3373,[5]7.1329,[6]7.2487,[7]7.5710,[8]7.1893,[9]6.8932,[10]6.5769,[11]6.9091,[12]7.0030,[13]6.8735,[14]6.6412,[15]6.7508,[16]6.6200,[17]6.5796,[18]6.6744,[19]6.6567,[20]6.6483,[21]6.6188,[22]6.6871,[23]6.8173,[24]6.9087,[25]6.9748,[26]7.0204,[27]6.9758,[28]7.0248,[29]7.0052,[30]6.9629,[31]6.9656,[32]6.9595,[33]6.9854,[34]7.0747,[35]7.1012,
Final estimate: PPL = 7.1012 +/- 0.05100

llama_print_timings:        load time =    1597.90 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =  136799.15 ms / 286720 tokens (    0.48 ms per token,  2095.92 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =  153523.82 ms / 286721 tokens

Dual GPU (asymmetric)

CUDA_VISIBLE_DEVICES=0,1 ./llama-perplexity -ngl 99 -m Meta-Llama-3-8B-Instruct-Q6_K.gguf -f wikitext-2-raw/wiki.test.raw -fa -c 8192 -s 0
main: build = 3497 (ebb346a2)
main: built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
main: seed  = 0
llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from /home/matteo/tmp/models_cache/Meta-Llama-3-8B-Instruct-Q6_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 18
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - kv  22:                      quantize.imatrix.file str              = /models/Meta-Llama-3-8B-Instruct-GGUF...
llama_model_loader: - kv  23:                   quantize.imatrix.dataset str              = /training_data/groups_merged.txt
llama_model_loader: - kv  24:             quantize.imatrix.entries_count i32              = 224
llama_model_loader: - kv  25:              quantize.imatrix.chunks_count i32              = 88
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q6_K:  226 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q6_K
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 6.14 GiB (6.56 BPW) 
llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.41 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   410.98 MiB
llm_load_tensors:      CUDA0 buffer size =  3413.12 MiB
llm_load_tensors:      CUDA1 buffer size =  2458.87 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   640.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   384.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
llama_new_context_with_model:      CUDA0 compute buffer size =   184.01 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =   322.52 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    72.02 MiB
llama_new_context_with_model: graph nodes  = 903
llama_new_context_with_model: graph splits = 3

system_info: n_threads = 6 / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
perplexity: tokenizing the input ..
perplexity: tokenization took 238.158 ms
perplexity: calculating perplexity over 35 chunks, n_ctx=8192, batch_size=2048, n_seq=1
perplexity: 3.62 seconds per pass - ETA 2.10 minutes
[1]3442.9673,[2]10336.3683,[3]15443.5686,[4]15998.9188,[5]18452.0495,[6]18443.6264,[7]19458.2455,[8]24798.3870,[9]26686.5635,[10]27926.4615,[11]25999.4890,[12]25119.5888,[13]24874.4490,[14]25837.0276,[15]26135.7007,[16]26395.1369,[17]26796.1002,[18]26673.7400,[19]26749.3320,[20]27077.7530,[21]27593.9652,[22]27544.3975,[23]27031.6756,[24]26395.0489,[25]26066.3652,[26]26163.4192,[27]26356.6977,[28]26136.6444,[29]26369.4260,[30]26857.9713,[31]27047.4020,[32]27054.4395,[33]27226.6587,[34]26779.8673,[35]26727.0947,
Final estimate: PPL = 26727.0947 +/- 628.43824

llama_print_timings:        load time =    1735.99 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =  105162.91 ms / 286720 tokens (    0.37 ms per token,  2726.44 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =  121788.97 ms / 286721 tokens
JohannesGaessler commented 2 months ago

This is definitely indicative of a problem. Do you consistently get the same wrong value? Do you get wrong values when you don't increase the context size?

slaren commented 2 months ago

main: build = 3497 (ebb346a2)

I am not able to find this commit.

matteoserva commented 2 months ago

@slaren I was in my branch but there are were no changes to the useful code. The parent commit is 7e72aa74fd676a093eb9970e761085ec22734c71 From now on I'm using master for the tests.

@JohannesGaessler Here are my findings so far:

It feels (subjectively) like that part of the prompt is deleted and replaced with another piece of prompt. The part that is corrupted is different for different seeds.

llama-server launch command: CUDA_VISIBLE_DEVICES=0,1 ./llama-server -ngl 99 -t 6 -c 8192 --host 0.0.0.0 -m Meta-Llama-3-8B-Instruct-Q6_K.gguf -fa --override-kv tokenizer.ggml.add_bos_token=bool:false -ts 50,50

perplexity at default context, multi device

CUDA_VISIBLE_DEVICES=0,1 ./llama-perplexity -ngl 99 -m Meta-Llama-3-8B-Instruct-Q6_K.gguf -f wikitext-2-raw/wiki.test.raw  -s 0
main: build = 3493 (7e72aa74)
main: built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
main: seed  = 0
llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from Meta-Llama-3-8B-Instruct-Q6_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 18
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - kv  22:                      quantize.imatrix.file str              = /models/Meta-Llama-3-8B-Instruct-GGUF...
llama_model_loader: - kv  23:                   quantize.imatrix.dataset str              = /training_data/groups_merged.txt
llama_model_loader: - kv  24:             quantize.imatrix.entries_count i32              = 224
llama_model_loader: - kv  25:              quantize.imatrix.chunks_count i32              = 88
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q6_K:  226 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q6_K
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 6.14 GiB (6.56 BPW)
llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.41 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   410.98 MiB
llm_load_tensors:      CUDA0 buffer size =  3413.12 MiB
llm_load_tensors:      CUDA1 buffer size =  2458.87 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   160.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =    96.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     1.96 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
llama_new_context_with_model:      CUDA0 compute buffer size =   208.01 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =   306.52 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    24.02 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 3

system_info: n_threads = 6 / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 227.272 ms
perplexity: calculating perplexity over 564 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 0.92 seconds per pass - ETA 2.17 minutes
[1]5.1489,[2]6.3096,[3]7.0102,[4]7.5845,[5]7.9727,[6]8.2444,[7]8.6831,[8]9.4177,[9]10.1248,[10]10.4992,[11]10.6242,[12]10.6847,[13]11.1240,[14]10.6108,[15]10.4896,[16]10.1652,[17]10.0895,[18]10.2005,[19]9.9029,[20]9.7389,[21]9.7575,[22]9.3420,[23]8.9818,[24]8.7760,[25]8.4781,[26]8.3677,[27]8.2412,[28]8.1061,[29]8.2162,[30]8.1910,[31]8.1701,[32]8.1273,[33]8.1779,[34]8.2145,[35]8.2579,[36]8.3699,[37]8.3277,[38]8.3574,[39]8.3259,[40]8.3448,[41]8.3367,[42]8.2508,[43]8.2860,[44]8.2190,[45]8.3323,[46]8.3616,[47]8.3622,[48]8.3437,[49]8.2920,[50]8.3610,[51]8.4321,[52]8.3978,[53]8.5268,[54]8.5405,[55]8.5387,[56]8.6077,[57]8.6335,[58]8.6515,[59]8.5777,[60]8.6286,[61]8.7183,[62]8.7939,[63]8.8735,[64]8.9666,[65]8.9453,[66]8.9254,[67]8.8932,[68]8.9318,[69]8.9937,[70]9.0093,[71]8.9810,[72]8.9236,[73]8.8851,[74]8.8897,[75]8.7876,[76]8.7517,[77]8.6861,[78]8.6990,[79]8.7169,[80]8.7297,[81]8.7135,[82]8.7301,[83]8.7446,[84]8.7238,[85]8.7191,[86]8.7023,[87]8.7989,[88]8.7878,[89]8.8115,[90]8.8170,[91]8.8072,[92]8.8036,[93]8.7859,[94]8.7854,[95]8.7715,[96]8.8100,[97]8.8211,[98]8.8193,[99]8.8159,[100]8.8088,[101]8.8067,[102]8.8491,[103]8.8772,[104]8.9470,[105]8.9361,[106]8.9909,[107]9.0116,[108]9.0186,[109]9.0779,[110]9.1335,[111]9.1481,[112]9.1065,[113]9.0967,[114]9.0893,[115]9.0695,[116]9.0733,[117]9.0606,[118]9.0301,[119]9.0024,[120]8.9757,[121]8.9434,[122]8.9199,[123]8.8909,[124]8.8344,[125]8.7739,[126]8.7430,[127]8.7077,[128]8.7049,[129]8.7085,[130]8.7240,[131]8.7266,[132]8.7049,[133]8.6750,[134]8.6923,[135]8.6862,[136]8.6869,[137]8.6958,[138]8.7254,[139]8.7567,[140]8.7304,[141]8.6823,[142]8.6414,[143]8.5777,[144]8.5320,[145]8.4731,[146]8.4305,[147]8.3961,[148]8.3675,[149]8.3388,[150]8.3149,[151]8.2766,[152]8.2359,[153]8.2022,[154]8.1570,[155]8.1296,[156]8.1092,[157]8.0739,[158]8.0676,[159]8.0381,[160]8.0229,[161]8.0427,[162]8.0412,[163]8.0652,[164]8.0727,[165]8.1074,[166]8.1463,[167]8.1718,[168]8.2201,[169]8.2425,[170]8.2804,[171]8.3249,[172]8.3374,[173]8.3469,[174]8.3400,[175]8.3665,[176]8.3744,[177]8.3829,[178]8.3988,[179]8.3976,[180]8.4073,[181]8.4095,[182]8.4201,[183]8.4460,[184]8.4632,[185]8.4746,[186]8.4784,[187]8.5073,[188]8.5243,[189]8.5435,[190]8.5570,[191]8.5437,[192]8.5310,[193]8.5170,[194]8.5148,[195]8.5521,[196]8.5443,[197]8.5435,[198]8.5308,[199]8.5205,[200]8.5020,[201]8.4651,[202]8.4606,[203]8.4211,[204]8.4129,[205]8.4018,[206]8.3866,[207]8.3756,[208]8.3837,[209]8.3911,[210]8.3936,[211]8.3739,[212]8.3421,[213]8.3314,[214]8.3388,[215]8.3233,[216]8.3246,[217]8.3007,[218]8.2810,[219]8.2735,[220]8.2692,[221]8.2436,[222]8.2237,[223]8.2077,[224]8.1985,[225]8.2029,[226]8.1928,[227]8.1686,[228]8.1620,[229]8.1514,[230]8.1340,[231]8.1323,[232]8.1373,[233]8.1476,[234]8.1495,[235]8.1684,[236]8.1717,[237]8.1912,[238]8.2034,[239]8.2112,[240]8.2145,[241]8.2187,[242]8.2366,[243]8.2394,[244]8.2636,[245]8.2910,[246]8.2976,[247]8.2949,[248]8.3073,[249]8.2948,[250]8.2632,[251]8.2481,[252]8.2243,[253]8.2131,[254]8.2102,[255]8.2176,[256]8.2142,[257]8.2129,[258]8.2074,[259]8.2025,[260]8.1919,[261]8.1706,[262]8.1562,[263]8.1512,[264]8.1324,[265]8.1324,[266]8.1142,[267]8.1061,[268]8.0955,[269]8.0882,[270]8.0780,[271]8.0714,[272]8.0751,[273]8.0435,[274]8.0225,[275]8.0285,[276]8.0281,[277]8.0126,[278]8.0018,[279]8.0055,[280]8.0184,[281]8.0305,[282]8.0478,[283]8.0534,[284]8.0560,[285]8.0740,[286]8.0776,[287]8.0854,[288]8.0777,[289]8.0739,[290]8.0736,[291]8.0794,[292]8.0712,[293]8.0732,[294]8.0788,[295]8.0775,[296]8.0808,[297]8.0792,[298]8.0735,[299]8.0784,[300]8.0819,[301]8.0758,[302]8.0692,[303]8.0707,[304]8.0588,[305]8.0600,[306]8.0729,[307]8.0793,[308]8.0806,[309]8.0923,[310]8.0829,[311]8.0826,[312]8.0953,[313]8.1088,[314]8.1298,[315]8.1359,[316]8.1446,[317]8.1386,[318]8.1444,[319]8.1352,[320]8.1273,[321]8.1277,[322]8.1255,[323]8.1152,[324]8.1223,[325]8.1112,[326]8.1154,[327]8.1181,[328]8.1125,[329]8.1040,[330]8.0872,[331]8.0938,[332]8.0899,[333]8.0851,[334]8.0826,[335]8.0654,[336]8.0616,[337]8.0526,[338]8.0453,[339]8.0403,[340]8.0447,[341]8.0460,[342]8.0511,[343]8.0613,[344]8.0753,[345]8.0779,[346]8.0816,[347]8.0844,[348]8.0933,[349]8.0991,[350]8.1027,[351]8.1025,[352]8.1073,[353]8.1327,[354]8.1534,[355]8.1743,[356]8.1895,[357]8.2091,[358]8.2241,[359]8.2438,[360]8.2562,[361]8.2608,[362]8.2754,[363]8.2817,[364]8.2801,[365]8.2892,[366]8.3047,[367]8.3166,[368]8.3255,[369]8.3327,[370]8.3442,[371]8.3592,[372]8.3759,[373]8.3760,[374]8.3716,[375]8.3623,[376]8.3677,[377]8.3857,[378]8.4002,[379]8.4006,[380]8.3961,[381]8.3873,[382]8.3899,[383]8.3963,[384]8.3986,[385]8.4039,[386]8.4071,[387]8.4131,[388]8.4196,[389]8.4227,[390]8.4116,[391]8.3999,[392]8.3900,[393]8.3963,[394]8.3966,[395]8.3929,[396]8.3939,[397]8.4092,[398]8.4062,[399]8.4010,[400]8.4127,[401]8.4116,[402]8.4020,[403]8.4049,[404]8.4014,[405]8.4061,[406]8.4123,[407]8.4133,[408]8.4063,[409]8.4145,[410]8.4075,[411]8.4058,[412]8.3945,[413]8.3960,[414]8.4053,[415]8.4113,[416]8.4139,[417]8.4094,[418]8.4128,[419]8.4084,[420]8.4088,[421]8.4124,[422]8.4077,[423]8.4135,[424]8.4093,[425]8.3922,[426]8.3955,[427]8.3945,[428]8.3887,[429]8.3784,[430]8.3788,[431]8.3695,[432]8.3627,[433]8.3599,[434]8.3598,[435]8.3470,[436]8.3518,[437]8.3466,[438]8.3416,[439]8.3416,[440]8.3405,[441]8.3437,[442]8.3449,[443]8.3630,[444]8.3679,[445]8.3661,[446]8.3619,[447]8.3617,[448]8.3684,[449]8.3670,[450]8.3631,[451]8.3646,[452]8.3720,[453]8.3761,[454]8.3771,[455]8.3829,[456]8.3746,[457]8.3777,[458]8.3641,[459]8.3710,[460]8.3813,[461]8.3790,[462]8.3767,[463]8.3697,[464]8.3741,[465]8.3905,[466]8.3998,[467]8.3974,[468]8.3988,[469]8.3963,[470]8.3954,[471]8.3913,[472]8.3854,[473]8.3779,[474]8.3768,[475]8.3757,[476]8.3742,[477]8.3644,[478]8.3627,[479]8.3572,[480]8.3601,[481]8.3626,[482]8.3672,[483]8.3596,[484]8.3611,[485]8.3553,[486]8.3607,[487]8.3678,[488]8.3723,[489]8.3747,[490]8.3796,[491]8.3780,[492]8.3809,[493]8.3891,[494]8.3900,[495]8.3859,[496]8.3837,[497]8.3835,[498]8.3804,[499]8.3810,[500]8.3771,[501]8.3695,[502]8.3705,[503]8.3727,[504]8.3710,[505]8.3650,[506]8.3669,[507]8.3695,[508]8.3768,[509]8.3738,[510]8.3741,[511]8.3690,[512]8.3716,[513]8.3725,[514]8.3747,[515]8.3724,[516]8.3759,[517]8.3786,[518]8.3738,[519]8.3749,[520]8.3805,[521]8.3839,[522]8.3960,[523]8.3931,[524]8.3862,[525]8.3889,[526]8.3900,[527]8.3944,[528]8.3905,[529]8.3791,[530]8.3682,[531]8.3765,[532]8.3676,[533]8.3612,[534]8.3414,[535]8.3314,[536]8.3299,[537]8.3328,[538]8.3375,[539]8.3350,[540]8.3422,[541]8.3447,[542]8.3518,[543]8.3611,[544]8.3689,[545]8.3679,[546]8.3791,[547]8.3847,[548]8.3745,[549]8.3710,[550]8.3609,[551]8.3627,[552]8.3663,[553]8.3738,[554]8.3740,[555]8.3732,[556]8.3704,[557]8.3625,[558]8.3661,[559]8.3666,[560]8.3720,[561]8.3788,[562]8.3947,[563]8.3869,[564]8.3895,
Final estimate: PPL = 8.3895 +/- 0.06230

llama_print_timings:        load time =    2607.93 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =  111317.76 ms / 288768 tokens (    0.39 ms per token,  2594.09 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =  123032.41 ms / 288769 tokens

perplexity with -c 8192, single device

CUDA_VISIBLE_DEVICES=0 ./llama-perplexity -ngl 99 -m Meta-Llama-3-8B-Instruct-Q6_K.gguf -f wikitext-2-raw/wiki.test.raw -fa -c 8192 -s 0
main: build = 3493 (7e72aa74)
main: built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
main: seed  = 0
llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from Meta-Llama-3-8B-Instruct-Q6_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 18
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - kv  22:                      quantize.imatrix.file str              = /models/Meta-Llama-3-8B-Instruct-GGUF...
llama_model_loader: - kv  23:                   quantize.imatrix.dataset str              = /training_data/groups_merged.txt
llama_model_loader: - kv  24:             quantize.imatrix.entries_count i32              = 224
llama_model_loader: - kv  25:              quantize.imatrix.chunks_count i32              = 88
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q6_K:  226 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q6_K
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 6.14 GiB (6.56 BPW)
llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   410.98 MiB
llm_load_tensors:      CUDA0 buffer size =  5871.99 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   258.50 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    24.01 MiB
llama_new_context_with_model: graph nodes  = 903
llama_new_context_with_model: graph splits = 2

system_info: n_threads = 6 / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | A                                                                    RM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 227.898 ms
perplexity: calculating perplexity over 35 chunks, n_ctx=8192, batch_size=2048, n_seq=1
perplexity: 4.32 seconds per pass - ETA 2.52 minutes
[1]9.1824,[2]6.7775,[3]7.1044,[4]7.3373,[5]7.1329,[6]7.2487,[7]7.5710,[8]7.1893,[9]6.8932,[10]6.5769,[11]6.9091,[12]7.0030,[13]6.8735,[14]6.6412,[15]6.7508,[16]6.6200,[17]6.5796,[18]6.6744,[19]6.6567,[20]6.6483,[21]6.6188,[22]6.6871,[23]6.8173,[24]6.9087,[25]6.9748,[26]7.0204,[27]6.9758,[28]7.0248,[29]7.0052,[30]6.9629,[31]6.9656,[32]6.9595,[33]6.9854,[34]7.0747,[35]7.1012,
Final estimate: PPL = 7.1012 +/- 0.05100

llama_print_timings:        load time =    4389.31 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =  135274.64 ms / 286720 tokens (    0.47 ms per token,  2119.54 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =  150919.21 ms / 286721 tokens

perplexity with -c 8192, multi device

CUDA_VISIBLE_DEVICES=0,1 ./llama-perplexity -ngl 99 -m Meta-Llama-3-8B-Instruct-Q6_K.gguf -f wikitext-2-raw/wiki.test.raw -fa -c 8192 -s 0
main: build = 3493 (7e72aa74)
main: built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
main: seed  = 0
llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from Meta-Llama-3-8B-Instruct-Q6_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 18
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - kv  22:                      quantize.imatrix.file str              = /models/Meta-Llama-3-8B-Instruct-GGUF...
llama_model_loader: - kv  23:                   quantize.imatrix.dataset str              = /training_data/groups_merged.txt
llama_model_loader: - kv  24:             quantize.imatrix.entries_count i32              = 224
llama_model_loader: - kv  25:              quantize.imatrix.chunks_count i32              = 88
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q6_K:  226 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q6_K
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 6.14 GiB (6.56 BPW)
llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.41 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   410.98 MiB
llm_load_tensors:      CUDA0 buffer size =  3413.12 MiB
llm_load_tensors:      CUDA1 buffer size =  2458.87 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   640.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   384.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
llama_new_context_with_model:      CUDA0 compute buffer size =   184.01 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =   322.52 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    72.02 MiB
llama_new_context_with_model: graph nodes  = 903
llama_new_context_with_model: graph splits = 3

system_info: n_threads = 6 / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 229.205 ms
perplexity: calculating perplexity over 35 chunks, n_ctx=8192, batch_size=2048, n_seq=1
perplexity: 3.58 seconds per pass - ETA 2.08 minutes
[1]3442.9673,[2]10336.3683,[3]15443.5686,[4]15998.9188,[5]18452.0495,[6]18443.6264,[7]19458.2455,[8]24798.3870,[9]26686.5635,[10]27926.4615,[11]25999.4890,[12]25119.5888,[13]24874.4490,[14]25837.0276,[15]26135.7007,[16]26395.1369,[17]26796.1002,[18]26673.7400,[19]26749.3320,[20]27077.7530,[21]27593.9652,[22]27544.3975,[23]27031.6756,[24]26395.0489,[25]26066.3652,[26]26163.4192,[27]26356.6977,[28]26136.6444,[29]26369.4260,[30]26857.9713,[31]27047.4020,[32]27054.4395,[33]27226.6587,[34]26779.8673,[35]26727.0947,
Final estimate: PPL = 26727.0947 +/- 628.43824

llama_print_timings:        load time =    1671.49 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =  104190.55 ms / 286720 tokens (    0.36 ms per token,  2751.88 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =  119689.88 ms / 286721 tokens

perplexity with -c 8192 -ts 50,50, multi device

The two cards have different amounts of vram, 16GB and 12GB respectively. So to split the tensors equally I have to set -ts explicitly

CUDA_VISIBLE_DEVICES=0,1 ./llama-perplexity -ngl 99 -m Meta-Llama-3-8B-Instruct-Q6_K.gguf -f wikitext-2-raw/wiki.test.raw -fa -c 8192 -s 0 -ts 50,50
main: build = 3493 (7e72aa74)
main: built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
main: seed  = 0
llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from Meta-Llama-3-8B-Instruct-Q6_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
[...]
perplexity: calculating perplexity over 35 chunks, n_ctx=8192, batch_size=2048, n_seq=1
perplexity: 3.95 seconds per pass - ETA 2.30 minutes
[1]9.1668,[2]6.7743,[3]7.0986,[4]7.3433,[5]7.1367,[6]7.2513,[7]7.5730,[8]7.1911,[9]6.8931,[10]6.5806,[11]6.9130,[12]7.0063,[13]6.8811,[14]6.6466,[15]6.7547,[16]6.6227,[17]6.5837,[18]6.6777,[19]6.6592,[20]6.6499,[21]6.6197,[22]6.6871,[23]6.8175,[24]6.9093,[25]6.9753,[26]7.0204,[27]6.9754,[28]7.0235,[29]7.0035,[30]6.9600,[31]6.9621,[32]6.9555,[33]6.9814,[34]7.0702,[35]7.0964,
Final estimate: PPL = 7.0964 +/- 0.05088

perplexity with -c 8192 -ts 16,12, multi device

I think this is equivalent to not setting the tensor split

CUDA_VISIBLE_DEVICES=0,1 ./llama-perplexity -ngl 99 -m Meta-Llama-3-8B-Instruct-Q6_K.gguf -f wikitext-2-raw/wiki.test.raw -fa -c 8192 -s 0 -ts 16,12
main: build = 3493 (7e72aa74)
main: built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
main: seed  = 0
[...]
perplexity: calculating perplexity over 35 chunks, n_ctx=8192, batch_size=2048, n_seq=1
perplexity: 3.63 seconds per pass - ETA 2.12 minutes
[1]3447.2149,[2]10341.8739,[3]15449.9300,[4]16004.6169,[5]18456.0563,[6]18449.2137,[7]19461.3013,[8]24806.9826,[9]26699.2306,[10]27934.9471,[11]26012.4039,[12]25130.7018,[13]24889.0574,[14]25854.9384,[15]26152.9036,[16]26410.5202,[17]26811.3885,[18]26689.5861,[19]26763.6813,[20]27092.3216,[21]27608.9161,[22]27560.7937,[23]27045.6467,[24]26408.3619,[25]26079.7973,[26]26176.2798,[27]26369.2740,[28]26149.1093,[29]26380.3555,[30]26869.3107,[31]27058.9436,[32]27066.4507,[33]27237.8933,[34]26790.7003,[35]26737.4969,
Final estimate: PPL = 26737.4969 +/- 628.70137

glm-4-9b, default context, (both single and multi device)

CUDA_VISIBLE_DEVICES=0,1 ./llama-perplexity -ngl 99 -m glm-4-9b-chat-Q6_K.gguf -f wikitext-2-raw/wiki.test.raw  -s 0
main: build = 3493 (7e72aa74)
main: built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
main: seed  = 0
llama_model_loader: loaded meta data with 24 key-value pairs and 283 tensors from /home/matteo/tmp/models_cache/glm-4-9b-chat-Q6_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = chatglm
llama_model_loader: - kv   1:                               general.name str              = glm-4-9b-chat
llama_model_loader: - kv   2:                     chatglm.context_length u32              = 131072
llama_model_loader: - kv   3:                   chatglm.embedding_length u32              = 4096
llama_model_loader: - kv   4:                chatglm.feed_forward_length u32              = 13696
llama_model_loader: - kv   5:                        chatglm.block_count u32              = 40
llama_model_loader: - kv   6:               chatglm.attention.head_count u32              = 32
llama_model_loader: - kv   7:            chatglm.attention.head_count_kv u32              = 2
llama_model_loader: - kv   8:   chatglm.attention.layer_norm_rms_epsilon f32              = 0.000000
llama_model_loader: - kv   9:                          general.file_type u32              = 18
llama_model_loader: - kv  10:               chatglm.rope.dimension_count u32              = 64
llama_model_loader: - kv  11:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  12:                     chatglm.rope.freq_base f32              = 5000000.000000
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = chatglm-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,151552]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,151552]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,151073]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  18:            tokenizer.ggml.padding_token_id u32              = 151329
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 151329
llama_model_loader: - kv  20:                tokenizer.ggml.eot_token_id u32              = 151336
llama_model_loader: - kv  21:            tokenizer.ggml.unknown_token_id u32              = 151329
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = [gMASK]<sop>{% for item in messages %...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q8_0:   40 tensors
llama_model_loader: - type q6_K:  122 tensors
llm_load_vocab: special tokens cache size = 223
llm_load_vocab: token to piece cache size = 0.9732 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = chatglm
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151552
llm_load_print_meta: n_merges         = 151073
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 2
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 16
llm_load_print_meta: n_embd_k_gqa     = 256
llm_load_print_meta: n_embd_v_gqa     = 256
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.6e-07
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 13696
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 5000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 9B
llm_load_print_meta: model ftype      = Q6_K
llm_load_print_meta: model params     = 9.40 B
llm_load_print_meta: model size       = 7.69 GiB (7.03 BPW)
llm_load_print_meta: general.name     = glm-4-9b-chat
llm_load_print_meta: EOS token        = 151329 '<|endoftext|>'
llm_load_print_meta: UNK token        = 151329 '<|endoftext|>'
llm_load_print_meta: PAD token        = 151329 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 151336 '<|user|>'
llm_load_print_meta: max token length = 1024
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.43 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors:        CPU buffer size =   485.62 MiB
llm_load_tensors:      CUDA0 buffer size =  4141.37 MiB
llm_load_tensors:      CUDA1 buffer size =  3246.55 MiB
......................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 5000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =    48.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =    32.00 MiB
llama_new_context_with_model: KV self size  =   80.00 MiB, K (f16):   40.00 MiB, V (f16):   40.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     2.31 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
llama_new_context_with_model:      CUDA0 compute buffer size =   209.01 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =   352.02 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    24.02 MiB
llama_new_context_with_model: graph nodes  = 1606
llama_new_context_with_model: graph splits = 3

system_info: n_threads = 6 / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 825.084 ms
perplexity: calculating perplexity over 565 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 1.17 seconds per pass - ETA 2.75 minutes
[1]5.6779,[2]7.1003,[3]7.3976,[4]7.7992,[5]8.1160,[6]8.6335,[7]8.8864,[8]9.4016,[9]10.0815,[10]10.7221,[11]nan,[12]nan,[13]nan,[14]nan,[15]nan,[16]nan,[17]nan,[18]nan,[19]nan,[20]nan,[21]nan,[22]nan,[23]nan,[24]nan,[25]nan,[26]nan,[27]nan,[28]nan,[29]nan,[30]nan,[31]nan,[32]nan,[33]nan,[34]nan,[35]nan,[36]nan,[37]nan,[38]nan,[39]nan,[40]nan,[41]nan,[42]nan,[43]nan,[44]nan,[45]nan,[46]nan,[47]nan,[48]nan,[49]nan,[50]nan,[51]nan,[52]nan,[53]nan,[54]nan,[55]nan,[56]nan,[57]nan,[58]nan,[59]nan,[60]nan,[61]nan,[62]nan,[63]nan,[64]nan,[65]nan,[66]nan,[67]nan,[68]nan,[69]nan,[70]nan,[71]nan,[72]nan,[73]nan,[74]nan,[75]nan,[76]nan,[77]nan,[78]nan,[79]nan,[80]nan,[81]nan,[82]nan,[83]nan,[84]nan,[85]nan,[86]nan,[87]nan,[88]nan,[89]nan,[90]nan,[91]nan,[92]nan,[93]nan,[94]nan,[95]nan,[96]nan,[97]nan,[98]nan,[99]nan,[100]nan,[101]nan,[102]nan,[103]nan,[104]nan,[105]nan,[106]nan,[107]nan,[108]nan,[109]nan,[110]nan,[111]nan,[112]nan,[113]nan,[114]nan,[115]nan,[116]nan,[117]nan,[118]nan,[119]nan,[120]nan,[121]nan,[122]nan,[123]nan,[124]nan,[125]nan,[126]nan,[127]nan,[128]nan,[129]nan,[130]nan,[131]nan,[132]nan,[133]nan,[134]nan,[135]nan,[136]nan,[137]nan,[138]nan,[139]nan,[140]nan,[141]nan,[142]nan,[143]nan,[144]nan,[145]nan,[146]nan,[147]nan,[148]nan,[149]nan,[150]nan,[151]nan,[152]nan,[153]nan,[154]nan,[155]nan,[156]nan,[157]nan,[158]nan,[159]nan,[160]nan,[161]nan,[162]nan,[163]nan,[164]nan,[165]nan,[166]nan,[167]nan,[168]nan,[169]nan,[170]nan,[171]nan,[172]nan,[173]nan,[174]nan,[175]nan,[176]nan,[177]nan,[178]nan,[179]nan,[180]nan,[181]nan,[182]nan,[183]nan,[184]nan,[185]nan,[186]nan,[187]nan,[188]nan,[189]nan,[190]nan,[191]nan,[192]nan,[193]nan,[194]nan,[195]nan,[196]nan,[197]nan,[198]nan,[199]nan,[200]nan,[201]nan,[202]nan,[203]nan,[204]nan,[205]nan,[206]nan,[207]nan,[208]nan,[209]nan,[210]nan,[211]nan,[212]nan,[213]nan,[214]nan,[215]nan,[216]nan,[217]nan,[218]nan,[219]nan,[220]nan,[221]nan,[222]nan,[223]nan,[224]nan,[225]nan,[226]nan,[227]nan,[228]nan,[229]nan,[230]nan,[231]nan,[232]nan,[233]nan,[234]nan,[235]nan,[236]nan,[237]nan,[238]nan,[239]nan,[240]nan,[241]nan,[242]nan,[243]nan,[244]nan,[245]nan,[246]nan,[247]nan,[248]nan,[249]nan,[250]nan,[251]nan,[252]nan,[253]nan,[254]nan,[255]nan,[256]nan,[257]nan,[258]nan,[259]nan,[260]nan,[261]nan,[262]nan,[263]nan,[264]nan,[265]nan,[266]nan,[267]nan,[268]nan,[269]nan,[270]nan,[271]nan,[272]nan,[273]nan,[274]nan,[275]nan,[276]nan,[277]nan,[278]nan,[279]nan,[280]nan,[281]nan,[282]nan,[283]nan,[284]nan,[285]nan,[286]nan,[287]nan,[288]nan,[289]nan,[290]nan,[291]nan,[292]nan,[293]nan,[294]nan,[295]nan,[296]nan,[297]nan,[298]nan,[299]nan,[300]nan,[301]nan,[302]nan,[303]nan,[304]nan,[305]nan,[306]nan,[307]nan,[308]nan,[309]nan,[310]nan,[311]nan,[312]nan,[313]nan,[314]nan,[315]nan,[316]nan,[317]nan,[318]nan,[319]nan,[320]nan,[321]nan,[322]nan,[323]nan,[324]nan,[325]nan,[326]nan,[327]nan,[328]nan,[329]nan,[330]nan,[331]nan,[332]nan,[333]nan,[334]nan,[335]nan,[336]nan,[337]nan,[338]nan,[339]nan,[340]nan,[341]nan,[342]nan,[343]nan,[344]nan,[345]nan,[346]nan,[347]nan,[348]nan,[349]nan,[350]nan,[351]nan,[352]nan,[353]nan,[354]nan,[355]nan,[356]nan,[357]nan,[358]nan,[359]nan,[360]nan,[361]nan,[362]nan,[363]nan,[364]nan,[365]nan,[366]nan,[367]nan,[368]nan,[369]nan,[370]nan,[371]nan,[372]nan,[373]nan,[374]nan,[375]nan,[376]nan,[377]nan,[378]nan,[379]nan,[380]nan,[381]nan,[382]nan,[383]nan,[384]nan,[385]nan,[386]nan,[387]nan,[388]nan,[389]nan,[390]nan,[391]nan,[392]nan,[393]nan,[394]nan,[395]nan,[396]nan,[397]nan,[398]nan,[399]nan,[400]nan,[401]nan,[402]nan,[403]nan,[404]nan,[405]nan,[406]nan,[407]nan,[408]nan,[409]nan,[410]nan,[411]nan,[412]nan,[413]nan,[414]nan,[415]nan,[416]nan,[417]nan,[418]nan,[419]nan,[420]nan,[421]nan,[422]nan,[423]nan,[424]nan,[425]nan,[426]nan,[427]nan,[428]nan,[429]nan,[430]nan,[431]nan,[432]nan,[433]nan,[434]nan,[435]nan,[436]nan,[437]nan,[438]nan,[439]nan,[440]nan,[441]nan,[442]nan,[443]nan,[444]nan,[445]nan,[446]nan,[447]nan,[448]nan,[449]nan,[450]nan,[451]nan,[452]nan,[453]nan,[454]nan,[455]nan,[456]nan,[457]nan,[458]nan,[459]nan,[460]nan,[461]nan,[462]nan,[463]nan,[464]nan,[465]nan,[466]nan,[467]nan,[468]nan,[469]nan,[470]nan,[471]nan,[472]nan,[473]nan,[474]nan,[475]nan,[476]nan,[477]nan,[478]nan,[479]nan,[480]nan,[481]nan,[482]nan,[483]nan,[484]nan,[485]nan,[486]nan,[487]nan,[488]nan,[489]nan,[490]nan,[491]nan,[492]nan,[493]nan,[494]nan,[495]nan,[496]nan,[497]nan,[498]nan,[499]nan,[500]nan,[501]nan,[502]nan,[503]nan,[504]nan,[505]nan,[506]nan,[507]nan,[508]nan,[509]nan,[510]nan,[511]nan,[512]nan,[513]nan,[514]nan,[515]nan,[516]nan,[517]nan,[518]nan,[519]nan,[520]nan,[521]nan,[522]nan,[523]nan,[524]nan,[525]nan,[526]nan,[527]nan,[528]nan,[529]nan,[530]nan,[531]nan,[532]nan,[533]nan,[534]nan,[535]nan,[536]nan,[537]nan,[538]nan,[539]nan,[540]nan,[541]nan,[542]nan,[543]nan,[544]nan,[545]nan,[546]nan,[547]nan,[548]nan,[549]nan,[550]nan,[551]nan,[552]nan,[553]nan,[554]nan,[555]nan,[556]nan,[557]nan,[558]nan,[559]nan,[560]nan,[561]nan,[562]nan,[563]nan,[564]nan,[565]nan,
Unexpected negative standard deviation of log(prob)

llama_print_timings:        load time =   21852.66 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =  143138.30 ms / 289280 tokens (    0.49 ms per token,  2020.98 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =  155753.23 ms / 289281 tokens
JohannesGaessler commented 2 months ago

As I asked before: are you consistently getting the exact same bad perplexity value every time or do you get different bad values? This is the most important piece of information that I need because it helps narrow down what parts of the code could be the problem.

matteoserva commented 2 months ago

Yes. Repeated runs of llama-perplexity with the same arguments give consistently the same perplexity values.

JohannesGaessler commented 2 months ago

Is it fixed by compiling with GGML_CUDA_FORCE_CUBLAS?

matteoserva commented 2 months ago

With GGML_CUDA_FORCE_CUBLAS I'm getting the correct results both in llama-perplexity and llama-server.

Compile string:

make GGML_CUDA=1 GGML_CUDA_FORCE_CUBLAS=1 -j 6

Full log:

CUDA_VISIBLE_DEVICES=0,1 ./llama-perplexity -ngl 99 -m Meta-Llama-3-8B-Instruct-Q6_K.gguf -f wikitext-2-raw/wiki.test.raw -fa -c 8192 -s 0
main: build = 3493 (7e72aa74)
main: built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
main: seed  = 0
llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from Meta-Llama-3-8B-Instruct-Q6_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 18
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - kv  22:                      quantize.imatrix.file str              = /models/Meta-Llama-3-8B-Instruct-GGUF...
llama_model_loader: - kv  23:                   quantize.imatrix.dataset str              = /training_data/groups_merged.txt
llama_model_loader: - kv  24:             quantize.imatrix.entries_count i32              = 224
llama_model_loader: - kv  25:              quantize.imatrix.chunks_count i32              = 88
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q6_K:  226 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q6_K
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 6.14 GiB (6.56 BPW)
llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: yes
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.41 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   410.98 MiB
llm_load_tensors:      CUDA0 buffer size =  3413.12 MiB
llm_load_tensors:      CUDA1 buffer size =  2458.87 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   640.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   384.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
llama_new_context_with_model:      CUDA0 compute buffer size =   184.01 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =   322.52 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    72.02 MiB
llama_new_context_with_model: graph nodes  = 903
llama_new_context_with_model: graph splits = 3

system_info: n_threads = 6 / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 232.696 ms
perplexity: calculating perplexity over 35 chunks, n_ctx=8192, batch_size=2048, n_seq=1
perplexity: 4.96 seconds per pass - ETA 2.88 minutes
[1]9.1803,[2]6.7762,[3]7.1020,[4]7.3363,[5]7.1315,[6]7.2479,[7]7.5709,[8]7.1886,[9]6.8923,[10]6.5764,[11]6.9082,[12]7.0019,[13]6.8719,[14]6.6394,[15]6.7491,[16]6.6180,[17]6.5778,[18]6.6727,[19]6.6550,[20]6.6465,[21]6.6169,[22]6.6849,[23]6.8152,[24]6.9066,[25]6.9728,[26]7.0187,[27]6.9740,[28]7.0228,[29]7.0032,[30]6.9610,[31]6.9637,[32]6.9576,[33]6.9836,[34]7.0727,[35]7.0992,
Final estimate: PPL = 7.0992 +/- 0.05098

llama_print_timings:        load time =    1650.35 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =  118066.75 ms / 286720 tokens (    0.41 ms per token,  2428.46 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =  133573.47 ms / 286721 tokens
JohannesGaessler commented 2 months ago

I can reproduce getting consistently wrong results on 3x RTX 4090 with

export CUDA_VISIBLE_DEVICES=0,1,2
export model_name=llama_3-8b && export quantization=q6_k
./llama-perplexity --file wikitext-2-raw/wiki.test.raw --n-gpu-layers 99 -fa --model models/opt/${model_name}-${quantization}.gguf --chunks 1 -c 8192

Notably in my case GGML_CUDA_FORCE_CUBLAS does not fix the issue. The results are correct if any of the following is done:

This makes me think that the problem has to do with pipeline parallelism and that for some inputs the wrong data is being copied.

slaren commented 2 months ago

I cannot reproduce this with 3090Ti + 3080. If you can point at the commit that broke it, I can try to find the issue.

JohannesGaessler commented 2 months ago

According to git bisect:

f30ea47a87ed4446ad55adb265755dc9102956a2 is the first bad commit
commit f30ea47a87ed4446ad55adb265755dc9102956a2 (HEAD, tag: b2413)
Author: slaren <slarengh@gmail.com>
Date:   Wed Mar 13 18:54:21 2024 +0100

    llama : add pipeline parallelism support (#6017)

    * llama : add pipeline parallelism support for batch processing with multiple CUDA GPUs

    ggml-ci

    * server : add -ub, --ubatch-size parameter

    * fix server embedding test

    * llama : fix Mamba inference for pipeline parallelism

    Tested to work correctly with both `main` and `parallel` examples.

    * llama : limit max batch size to n_batch

    * add LLAMA_SCHED_MAX_COPIES to configure the number of input copies for pipeline parallelism
    default increase to 4 (from 2)

    changing this value may improve performance for some systems, but increases memory usage

    * fix hip build

    * fix sycl build (disable cpy_tensor_async)

    * fix hip build

    * llama : limit n_batch and n_ubatch to n_ctx during context creation

    * llama : fix norm backend

    * batched-bench : sync after decode

    * swiftui : sync after decode

    * ggml : allow ggml_get_rows to use multiple threads if they are available

    * check n_ubatch >= n_tokens with non-casual attention

    * llama : do not limit n_batch to n_ctx with non-casual attn

    * server : construct batch with size of llama_n_batch

    * ggml_backend_cpu_graph_compute : fix return value when alloc fails

    * llama : better n_batch and n_ubatch comment

    * fix merge

    * small fix

    * reduce default n_batch to 2048

    ---------

    Co-authored-by: Francis Couture-Harpin <git@compilade.net>
    Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

 CMakeLists.txt                                        |    3 +
 Makefile                                              |    4 +
 common/common.cpp                                     |   14 +-
 common/common.h                                       |    3 +-
 examples/batched-bench/batched-bench.cpp              |    2 +
 examples/embedding/embedding.cpp                      |    2 +-
 examples/llama-bench/llama-bench.cpp                  |   53 +++++-
 examples/llama.swiftui/llama.cpp.swift/LibLlama.swift |    2 +
 examples/perplexity/perplexity.cpp                    |    3 +-
 examples/server/server.cpp                            |   32 +++-
 examples/server/tests/features/embeddings.feature     |    1 +
 examples/server/tests/features/steps/steps.py         |    8 +
 ggml-alloc.c                                          |  109 +++++------
 ggml-alloc.h                                          |   18 +-
 ggml-backend-impl.h                                   |   17 +-
 ggml-backend.c                                        |  493 +++++++++++++++++++++++++++++++++++--------------
 ggml-backend.h                                        |   58 ++++--
 ggml-cuda.cu                                          |  175 +++++++++++++++---
 ggml-kompute.cpp                                      |    5 +
 ggml-metal.m                                          |    5 +
 ggml-sycl.cpp                                         |    7 +-
 ggml-vulkan.cpp                                       |    5 +
 ggml.c                                                |  113 +++++++-----
 llama.cpp                                             | 1131 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------------------------------------------------
 llama.h                                               |    9 +-
 25 files changed, 1426 insertions(+), 846 deletions(-)

So it does seem to be caused by pipeline parallelism. It's very finicky to reproduce this bug; it seems to only occur for some specific combinations of input parameters. If it occurs the results are wrong in a consistent way. So far I have only been able to reproduce the bug with 3 or more GPUs; if you're not able to reproduce the issue on your own machine I can maybe give you ssh access to my machine with 6x RTX 4090.

slaren commented 2 months ago

I tried using the exact same exact model, exact same split, etc, as @matteoserva reported with 2 GPUs, but I still cannot reproduce this. From the perspective of the implementation of pipeline parallelism both cases should be identical. CUBLAS or MMQ shouldn't matter either. So it must either a synchronization issue, or a CUDA or hardware issue. Can you do some basic tests to exclude the possibility hardware issue? For example, set a very low power limit, or try with a different combination of GPUs. If that still doesn't work, I will take your offer to use your machine.

JohannesGaessler commented 2 months ago

Can you do some basic tests to exclude the possibility hardware issue? For example, set a very low power limit

If there are stability issues related to power these issues can usually not be fixed by setting a lower power limit. The way I understand it is that random bit flips occur when power spikes from multiple GPUs happen to align, causing a voltage drop. A lower power limit set in software does not reduce these power spikes, it only lowers the average power consumption on a comparatively long timescale. However, the stability issues from power spikes can be fixed by instead capping the maximum boost clocks which reduces the maximum power consumption. When I did the testing I did so with the clocks limited to 1000 MHz. I would assume that stability issues would cause the results to be wrong in an inconsistent way though.

or try with a different combination of GPUs

I have one machine with 6x RTX 4090 and one with 3x P40. Using the latest master commit compiled with GGML_CUDA_FORCE_CUBLAS both machines produce incorrect results using the following parameters when using 3 equal GPUs:

export model_name=llama_3-8b && export quantization=q6_k
./llama-perplexity --file wikitext-2-raw/wiki.test.raw --n-gpu-layers 99 --model models/opt/${model_name}-${quantization}.gguf --chunks 1 -c 8192 -fa
slaren commented 2 months ago

@JohannesGaessler I still cannot reproduce the issue. I can try to debug it in your machine, if that's ok please send me a message to slarengh@gmail.com with the details.