ggerganov / llama.cpp

LLM inference in C/C++
MIT License
66.03k stars 9.48k forks source link

Bug: flash attention produces corrupt output with batched generation #7903

Closed nytopop closed 3 months ago

nytopop commented 3 months ago

What happened?

When flash attention is enabled, generating parallel sequences in a batch produces corrupt output on some GPUs. It seems to only happen on certain batch sizes. I've tested with llama3 8b, mistral 7b, and qwen2 0.5b, all give similar output.

To reproduce: ./parallel -ngl 400 -np 3 -ns 10 -fa -m some-model.gguf.

The broken batch sizes and backends I've tried:

-np \<n> has corruption
1, 2 no
3, 4, 5, 6, 7, 8 yes
9, 10, ..., 25 no
> 25 didn't test
backend has corruption
cpu no
cuda w/ gtx 1080 yes
rocm w/ mi60 no

Name and Version

$ ./parallel --version version: 3138 (704a35b1) built with cc (Debian 13.2.0-25) 13.2.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

$ ./parallel -m /home/eric/mnt/models/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -ngl 400 -np 3 -ns 10 -fa -ngl 400
Log start
llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from /home/eric/mnt/models/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 15
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - kv  22:                      quantize.imatrix.file str              = /models/Meta-Llama-3-8B-Instruct-GGUF...
llama_model_loader: - kv  23:                   quantize.imatrix.dataset str              = /training_data/groups_merged.txt
llama_model_loader: - kv  24:             quantize.imatrix.entries_count i32              = 224
llama_model_loader: - kv  25:              quantize.imatrix.chunks_count i32              = 88
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.58 GiB (4.89 BPW)
llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1080, compute capability 6.1, VMM: yes
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   281.81 MiB
llm_load_tensors:      CUDA0 buffer size =  4403.49 MiB
........................................................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     1.96 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   258.50 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    24.01 MiB
llama_new_context_with_model: graph nodes  = 903
llama_new_context_with_model: graph splits = 2

No new questions so proceed with build-in defaults.

main: Simulating parallel requests from clients:
main: n_parallel = 3, n_sequences = 10, cont_batching = 1, system tokens = 259

main: Evaluating the system prompt ...

Processing requests ...

main: clearing the KV cache
Client   0, seq    0, started decoding ...
Client   1, seq    1, started decoding ...
Client   2, seq    2, started decoding ...
Client   0, seq   0/ 10, prompt    9 t, response    3 t, time  0.59 s, speed 20.40 t/s, cache miss 0
Input:    What is the meaning of life?
Response: This
P

Client   0, seq    3, started decoding ...
Client   0, seq   3/ 10, prompt   20 t, response    3 t, time  0.18 s, speed 125.59 t/s, cache miss 0
Input:    Are you familiar with the Special Theory of Relativity and can you explain it to me?
Response: Yes

Client   0, seq    4, started decoding ...
Client   0, seq   4/ 10, prompt    9 t, response    4 t, time  0.21 s, speed 60.84 t/s, cache miss 0
Input:    Recommend some interesting books to read.
Response: I- S

Client   0, seq    5, started decoding ...
Client   0, seq   5/ 10, prompt   20 t, response    3 t, time  0.18 s, speed 127.55 t/s, cache miss 0
Input:    Are you familiar with the Special Theory of Relativity and can you explain it to me?
Response: Yes

...

Client   0, seq    6, started decoding ...
Client   0, seq   6/ 10, prompt   15 t, response    3 t, time  0.18 s, speed 100.19 t/s, cache miss 0
Input:    If you could have any superpower, what would it be?
Response: I
...S

Client   0, seq    7, started decoding ...
Client   1, seq   1/ 10, prompt   12 t, response   23 t, time  1.49 s, speed 23.56 t/s, cache miss 0
Input:    What is the best way to cook a steak?
Response: The best way to cook a steak is to grill it. Here's a simple recipe to achieve a perfectly cooked steak:

Client   1, seq    8, started decoding ...
Client   0, seq   7/ 10, prompt   12 t, response    5 t, time  0.27 s, speed 64.05 t/s, cache miss 0
Input:    I want to learn how to play the piano.
Response: The [...] ... T

Client   0, seq    9, started decoding ...
Client   0, seq   9/ 10, prompt    9 t, response    3 t, time  0.17 s, speed 69.66 t/s, cache miss 0
Input:    Recommend some interesting books to read.
Response: I…H

Client   2, seq   2/ 10, prompt    9 t, response  100 t, time  4.26 s, speed 25.59 t/s, cache miss 0
Input:    What is the meaning of life?
Response: This is a question that has puzzled philosophers and scientists for centuries. There is no one definitive answer, but many people believe that the meaning of life is to find happiness, fulfillment, and purpose through our experiences and relationships. Others believe that the meaning of life is to seek out knowledge and understanding, and to use our talents and abilities to make a positive impact on the world. Ultimately, the meaning of life is a personal and subjective question, and it is up to each individual to find their own answer.

Client   1, seq   8/ 10, prompt   13 t, response   84 t, time  3.03 s, speed 32.00 t/s, cache miss 0
Input:    What is the best way to learn a new language?
Response: There are several ways to learn a new language, but the best way is through immersion. Immersion is the process of surrounding yourself with the language you want to learn, such as by living in a country where the language is spoken, listening to music and watching movies in the language, and speaking with native speakers. You can also use language learning apps, such as Duolingo, and practice with a language exchange partner.

main: clearing the KV cache

run parameters as at 2024-06-12 05:33:13

main: n_parallel = 3, n_sequences = 10, cont_batching = 1, system tokens = 259
External prompt file: used built-in defaults
Model and path used:  /home/eric/mnt/models/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf

Total prompt tokens:    128, speed: 28.28 t/s
Total gen tokens:       231, speed: 51.04 t/s
Total speed (AVG):           speed: 79.31 t/s
Cache misses:             0

llama_print_timings:        load time =     950.53 ms
llama_print_timings:      sample time =      23.11 ms /   241 runs   (    0.10 ms per token, 10427.03 tokens per second)
llama_print_timings: prompt eval time =    4095.15 ms /   610 tokens (    6.71 ms per token,   148.96 tokens per second)
llama_print_timings:        eval time =     253.14 ms /     8 runs   (   31.64 ms per token,    31.60 tokens per second)
llama_print_timings:       total time =    4526.35 ms /   618 tokens
ggerganov commented 3 months ago

Does it depend on the quant type? Does it happen with F16 models?

nytopop commented 3 months ago

I've tried Q4_K_M, Q8_0, and fp16. They all show the same issue.

ggerganov commented 3 months ago

Not reproduced with RTX 2060:

LLAMA_CUDA=1 make -j && ./parallel -ngl 400 -np 3 -ns 100 -fa -m models/llama-8b-v3-instruct/ggml-model-q4_k.gguf
``` I ccache found, compilation results will be cached. Disable with LLAMA_NO_CCACHE. I llama.cpp build info: I UNAME_S: Linux I UNAME_P: x86_64 I UNAME_M: x86_64 I CFLAGS: -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_OPENMP -DGGML_USE_LLAMAFILE -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -DGGML_CUDA_USE_GRAPHS -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -fopenmp -Wdouble-promotion I CXXFLAGS: -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -fopenmp -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_OPENMP -DGGML_USE_LLAMAFILE -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -DGGML_CUDA_USE_GRAPHS I NVCCFLAGS: -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -arch=native -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 I LDFLAGS: -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib I CC: cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 I CXX: c++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 I NVCC: Build cuda_12.2.r12.2/compiler.33191640_0 make: Nothing to be done for 'default'. Log start llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from models/llama-8b-v3-instruct/ggml-model-q4_k.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 15 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128001 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_K: 193 tensors llama_model_loader: - type q6_K: 33 tensors llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.8000 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.58 GiB (4.89 BPW) llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128001 '<|end_of_text|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 2060 SUPER, compute capability 7.5, VMM: yes llm_load_tensors: ggml ctx size = 0.30 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 281.81 MiB llm_load_tensors: CUDA0 buffer size = 4403.49 MiB ........................................................................................ llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 1024.00 MiB llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 1.96 MiB llama_new_context_with_model: CUDA0 compute buffer size = 258.50 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 24.01 MiB llama_new_context_with_model: graph nodes = 903 llama_new_context_with_model: graph splits = 2 No new questions so proceed with build-in defaults. main: Simulating parallel requests from clients: main: n_parallel = 3, n_sequences = 100, cont_batching = 1, system tokens = 259 main: Evaluating the system prompt ... Processing requests ... main: clearing the KV cache Client 0, seq 0, started decoding ... Client 1, seq 1, started decoding ... Client 2, seq 2, started decoding ... Client 2, seq 2/100, prompt 9 t, response 104 t, time 3.28 s, speed 34.47 t/s, cache miss 0 Input: What is the meaning of life? Response: This is a question that has been debated by philosophers, scientists, and theologians for centuries. There is no one definitive answer, as the meaning of life is a deeply personal and subjective issue. Some people believe that the meaning of life is to seek happiness, while others believe that it is to find purpose and fulfillment. Some people believe that the meaning of life is to help others, while others believe that it is to learn and grow. Ultimately, the meaning of life is a question that each individual must answer for themselves. Client 2, seq 3, started decoding ... Client 0, seq 0/100, prompt 9 t, response 108 t, time 3.41 s, speed 34.35 t/s, cache miss 0 Input: What is the meaning of life? Response: Ah, a classic question! The meaning of life is a question that has puzzled philosophers and scholars for centuries. While there is no definitive answer, many people believe that the meaning of life is to find happiness, fulfillment, and purpose. It is to live a life of integrity, compassion, and kindness towards others. It is to pursue one's passions and interests, and to make a positive impact on the world. Some people also believe that the meaning of life is to find meaning in the present moment, rather than in the past or future. Client 0, seq 4, started decoding ... Client 1, seq 1/100, prompt 12 t, response 142 t, time 4.41 s, speed 34.88 t/s, cache miss 0 Input: What is the best way to cook a steak? Response: Cooking a steak can be a bit tricky, but I can give you some general tips. First, make sure you choose a good quality steak. Then, season it with some salt, pepper, and any other seasonings you like. Next, heat up a pan with some oil in it, and add the steak. Cook it for about 3-4 minutes per side, or until it reaches your desired level of doneness. You can also use a thermometer to check the internal temperature of the steak. For medium-rare, the temperature should be around 130°F (54°C), for medium, around 140°F (60°C), and for well-done, around 160°F (71°C). Client 1, seq 5, started decoding ... Client 2, seq 3/100, prompt 20 t, response 82 t, time 2.48 s, speed 41.17 t/s, cache miss 0 Input: Are you familiar with the Special Theory of Relativity and can you explain it to me? Response: Yes, I am familiar with the Special Theory of Relativity. It is a fundamental concept in physics that describes the relationship between space and time. In essence, the theory states that the laws of physics are the same for all observers, regardless of their relative motion. It also introduces the concept of time dilation, which is the phenomenon where time appears to slow down for an observer in motion relative to a stationary observer. Client 2, seq 6, started decoding ... Client 1, seq 5/100, prompt 20 t, response 75 t, time 2.27 s, speed 41.93 t/s, cache miss 0 Input: Are you familiar with the Special Theory of Relativity and can you explain it to me? Response: Yes, I am familiar with the Special Theory of Relativity. The Special Theory of Relativity was developed by Albert Einstein in 1905. It posits that the laws of physics are the same for all observers, regardless of their relative motion. This theory challenged the long-held notion that time and space are absolute and fixed, and it revolutionized our understanding of the universe. Client 1, seq 7, started decoding ... Client 0, seq 4/100, prompt 9 t, response 111 t, time 3.34 s, speed 35.96 t/s, cache miss 0 Input: Recommend some interesting books to read. Response: I recommend the following books: "The Hitchhiker's Guide to the Galaxy" by Douglas Adams, "The Three-Body Problem" by Liu Cixin, "The Nightingale" by Kristin Hannah, "The Handmaid's Tale" by Margaret Atwood, and "Sapiens: A Brief History of Humankind" by Yuval Noah Harari. These books are all highly rated and have been praised for their unique perspectives and engaging writing styles. They are also relatively short, so you can easily finish them in a few days. Client 0, seq 8, started decoding ... Client 1, seq 7/100, prompt 12 t, response 23 t, time 0.72 s, speed 48.87 t/s, cache miss 0 Input: I want to learn how to play the piano. Response: Learning to play the piano can be a rewarding and enjoyable experience. Here are some steps you can follow to get started: Client 1, seq 9, started decoding ... Client 1, seq 9/100, prompt 9 t, response 6 t, time 0.21 s, speed 71.49 t/s, cache miss 0 Input: Recommend some interesting books to read. Response: Here are a few book recommendations: Client 1, seq 10, started decoding ... Client 2, seq 6/100, prompt 15 t, response 84 t, time 2.54 s, speed 39.03 t/s, cache miss 0 Input: If you could have any superpower, what would it be? Response: That is a great question! I think I would choose the power of teleportation. I would be able to instantly transport myself to any place in the world, and I would be able to help people in need by transporting them to places where they can get the help they need. I would also be able to teleport objects, which would be very useful for many tasks. It would be a very convenient and powerful superpower. Client 2, seq 11, started decoding ... Client 0, seq 8/100, prompt 13 t, response 87 t, time 2.61 s, speed 38.25 t/s, cache miss 0 Input: What is the best way to learn a new language? Response: The best way to learn a new language is through immersion and practice. Start by learning the basics of the language, such as the alphabet, grammar, and common phrases. Then, practice speaking and listening by conversing with native speakers or watching TV shows and movies in the target language. You can also use language learning apps and online resources to supplement your learning. Additionally, try to read and write in the target language as much as possible. Client 0, seq 12, started decoding ... Client 2, seq 11/100, prompt 10 t, response 84 t, time 2.53 s, speed 37.21 t/s, cache miss 0 Input: How to get a job at Google? Response: To get a job at Google, you should have a bachelor's degree in a relevant field such as computer science, engineering, or mathematics. You should also have at least 1-2 years of relevant work experience. You can check out Google's job listings and apply for positions that match your skills and experience. It's also important to note that Google is a very competitive company and only accepts a small percentage of applicants. Client 2, seq 13, started decoding ... Client 0, seq 12/100, prompt 11 t, response 68 t, time 2.05 s, speed 38.50 t/s, cache miss 0 Input: Tell me an interesting fact about llamas. Response: Here's one: Llamas are able to spit as a form of defense, but they also have a unique way of communicating with each other using a variety of sounds, including humming, whistling, and grunting. They are also known for their soft and woolly coats, which are highly prized for their warmth and softness. Client 0, seq 14, started decoding ... Client 2, seq 13/100, prompt 12 t, response 76 t, time 2.29 s, speed 38.40 t/s, cache miss 0 Input: What is the best way to cook a steak? Response: There are several ways to cook a steak, but some methods are better than others. The best way to cook a steak is to grill it. Grilling a steak allows you to achieve a nice crust on the outside while keeping the inside juicy and tender. You can also cook a steak in a pan or oven, but grilling is the best way to get that perfect sear. Client 2, seq 15, started decoding ... Client 1, seq 10/100, prompt 10 t, response 195 t, time 5.83 s, speed 35.14 t/s, cache miss 0 Input: How to get a job at Google? Response: Getting a job at Google is a challenging but rewarding experience. Here are some tips to help you increase your chances of getting hired:1. Meet the basic requirements: Google looks for candidates who have a bachelor's degree in a relevant field, such as computer science, engineering, or a related field.2. Develop a strong online presence: Google uses search engines to find candidates, so make sure your online presence is professional and showcases your skills.3. Build a strong network: Networking is key to getting a job at Google. Attend industry events, join professional organizations, and connect with people in your field.4. Show your skills: Google looks for candidates who have demonstrated their skills and abilities through projects, portfolios, and other forms of evidence.5. Be persistent: It may take several attempts before you get hired by Google, so don't be discouraged if you don't get a job right away. Keep applying and networking, and eventually, you will get your chance. Client 1, seq 16, started decoding ... Client 2, seq 15/100, prompt 13 t, response 25 t, time 0.79 s, speed 48.33 t/s, cache miss 0 Input: What is the best way to learn a new language? Response: The best way to learn a new language is through immersion and practice. Here are some tips that have been found to be helpful: Client 2, seq 17, started decoding ... Client 0, seq 14/100, prompt 9 t, response 103 t, time 3.11 s, speed 35.97 t/s, cache miss 0 Input: What is the meaning of life? Response: Ah, a question that has puzzled philosophers and theologians for centuries! While there may not be a definitive answer, I can offer some insights. The meaning of life is often subjective and personal, and can be influenced by various factors such as culture, upbringing, and experiences. Some people find meaning in their work, relationships, or spiritual practices. Others may find meaning in personal achievements, helping others, or contributing to society. Ultimately, the meaning of life is a mystery that each person must explore and discover for themselves. ```
JohannesGaessler commented 3 months ago

I believe to have found the issue, please confirm whether this fix works: https://github.com/ggerganov/llama.cpp/pull/7904

nytopop commented 3 months ago

Probably limited to cards without tensor cores, then. Looking at ggml-cuda/fattn.cu: https://github.com/ggerganov/llama.cpp/blob/a9cae48003dfc4fe95b8f5c81682fc6e63425235/ggml-cuda/fattn.cu#L316-L323

I see corrupt output specifically when ggml_cuda_flash_attn_ext_vec_f32 is chosen. The tile variant seems fine.

On the rocm machine i'm testing on, ggml_cuda_flash_attn_ext_vec_f16 is used and that produces normal output.

nytopop commented 3 months ago

I believe to have found the issue, please confirm whether this fix works: #7904

Does seem to fix it, yes.