ggerganov / llama.cpp

LLM inference in C/C++
MIT License
66.75k stars 9.6k forks source link

Bug: Error when offloading falcon mamba layers on GPU #9932

Open vineel96 opened 5 days ago

vineel96 commented 5 days ago

What happened?

Hello, I have encountered error when I run llama.cpp with falcon mamba model where its layers are offloaded to GPU.

Steps to reproduce:

  1. git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
  2. mkdir build && cd build
  3. cmake .. -DGGML_CUDA=ON
  4. make GGML_CUDA=1
  5. model: falcon-mamba-7B-BF16.gguf (https://huggingface.co/tiiuae/falcon-mamba-7b-BF16-GGUF/tree/main)
  6. command: ./build/bin/llama-cli -m ../gguf_files/falcon-mamba-7B-BF16.gguf --ctx-size 2048 --n_predict 50 --prompt "There are two persons named ram and krishna" -ngl 64

Observations:

  1. When run with -ngl option I am getting "GGML_ASSERT(ggml_is_contiguous(src0))" failed error. (refer below log image)
  2. When run without -ngl option it is running fine that to on gpu(able to see process on nvidia-smi). But here the generation is very slow (very less tokens per second(~0.8) for eval time)

Is there support for falcon mamba on GPU?

Name and Version

version: 3902 (c81f3bbb) built with cc (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3) for aarch64-redhat-linux

What operating system are you seeing the problem on?

Linux

Relevant log output

./build/bin/llama-cli -m ../gguf_files/falcon-mamba-7B-BF16.gguf --ctx-size 2048 --n_predict 50  --prompt "There are two persons named ram and krishna" -ngl 64
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
build: 3902 (c81f3bbb) with cc (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3) for aarch64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 31 key-value pairs and 643 tensors from ../gguf_files/falcon-mamba-7B-BF16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = mamba
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Falcon Mamba 7b
llama_model_loader: - kv   3:                           general.basename str              = falcon-mamba
llama_model_loader: - kv   4:                         general.size_label str              = 7B
llama_model_loader: - kv   5:                            general.license str              = other
llama_model_loader: - kv   6:                       general.license.name str              = falcon-mamba-7b-license
llama_model_loader: - kv   7:                       general.license.link str              = https://falconllm.tii.ae/falcon-mamba...
llama_model_loader: - kv   8:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv   9:                           general.datasets arr[str,2]       = ["tiiuae/falcon-refinedweb", "Hugging...
llama_model_loader: - kv  10:                       mamba.context_length u32              = 1048576
llama_model_loader: - kv  11:                     mamba.embedding_length u32              = 4096
llama_model_loader: - kv  12:                  mamba.feed_forward_length u32              = 0
llama_model_loader: - kv  13:                 mamba.attention.head_count u32              = 0
llama_model_loader: - kv  14:                          mamba.block_count u32              = 64
llama_model_loader: - kv  15:                      mamba.ssm.conv_kernel u32              = 4
llama_model_loader: - kv  16:                       mamba.ssm.inner_size u32              = 8192
llama_model_loader: - kv  17:                       mamba.ssm.state_size u32              = 16
llama_model_loader: - kv  18:                   mamba.ssm.time_step_rank u32              = 256
llama_model_loader: - kv  19:     mamba.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  20:                       mamba.ssm.dt_b_c_rms bool             = true
llama_model_loader: - kv  21:                          general.file_type u32              = 32
llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = falcon
llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,65024]   = [">>TITLE<<", ">>ABSTRACT<<", ">>INTR...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,65024]   = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,64784]   = ["Ġ t", "Ġ a", "i n", "h e", "r e",...
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 11
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 11
llama_model_loader: - kv  30:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  513 tensors
llama_model_loader: - type bf16:  130 tensors
llm_load_vocab: special tokens cache size = 12
llm_load_vocab: token to piece cache size = 0.3884 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = mamba
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 65024
llm_load_print_meta: n_merges         = 64784
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 1048576
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 64
llm_load_print_meta: n_head           = 0
llm_load_print_meta: n_head_kv        = 0
llm_load_print_meta: n_rot            = 0
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 0
llm_load_print_meta: n_embd_head_v    = 0
llm_load_print_meta: n_gqa            = 0
llm_load_print_meta: n_embd_k_gqa     = 0
llm_load_print_meta: n_embd_v_gqa     = 0
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 0
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = -1
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 1048576
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 4
llm_load_print_meta: ssm_d_inner      = 8192
llm_load_print_meta: ssm_d_state      = 16
llm_load_print_meta: ssm_dt_rank      = 256
llm_load_print_meta: ssm_dt_b_c_rms   = 1
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = BF16
llm_load_print_meta: model params     = 7.27 B
llm_load_print_meta: model size       = 14.10 GiB (16.65 BPW) 
llm_load_print_meta: general.name     = Falcon Mamba 7b
llm_load_print_meta: BOS token        = 0 '>>TITLE<<'
llm_load_print_meta: EOS token        = 11 '<|endoftext|>'
llm_load_print_meta: PAD token        = 11 '<|endoftext|>'
llm_load_print_meta: LF token         = 138 'Ä'
llm_load_print_meta: EOT token        = 11 '<|endoftext|>'
llm_load_print_meta: EOG token        = 11 '<|endoftext|>'
llm_load_print_meta: max token length = 130
llm_load_tensors: ggml ctx size =    0.59 MiB
llm_load_tensors: offloading 64 repeating layers to GPU
llm_load_tensors: offloaded 64/65 layers to GPU
llm_load_tensors:        CPU buffer size = 14439.02 MiB
llm_load_tensors:      CUDA0 buffer size = 13423.00 MiB
...............................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =    38.00 MiB
llama_new_context_with_model: KV self size  =   38.00 MiB, K (f32):    6.00 MiB, V (f32):   32.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.25 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =    80.56 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   168.00 MiB
llama_new_context_with_model: graph nodes  = 3334
llama_new_context_with_model: graph splits = 517
llama_init_from_gpt_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
/home/rocky/vineel/llama.cpp/ggml/src/ggml-cuda/norm.cu:212: GGML_ASSERT(ggml_is_contiguous(src0)) failed
[New LWP 3039501]
[New LWP 3039508]
[New LWP 3039512]
[New LWP 3039545]
[New LWP 3039546]
[New LWP 3039547]
[New LWP 3039548]
[New LWP 3039549]
[New LWP 3039550]
[New LWP 3039551]
[New LWP 3039552]
[New LWP 3039553]
[New LWP 3039554]
[New LWP 3039555]
[New LWP 3039556]
[New LWP 3039557]
[New LWP 3039558]
[New LWP 3039559]
[New LWP 3039560]
[New LWP 3039561]
[New LWP 3039562]
[New LWP 3039563]
[New LWP 3039564]
[New LWP 3039565]
[New LWP 3039566]
[New LWP 3039567]
[New LWP 3039568]
[New LWP 3039569]
[New LWP 3039570]
[New LWP 3039571]
[New LWP 3039572]
[New LWP 3039573]
[New LWP 3039574]
[New LWP 3039575]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x0000ffff70efc2c0 in __GI___wait4 (pid=<optimized out>, stat_loc=0xffffc1d61e1c, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30        return SYSCALL_CANCEL (wait4, pid, stat_loc, options, usage);
#0  0x0000ffff70efc2c0 in __GI___wait4 (pid=<optimized out>, stat_loc=0xffffc1d61e1c, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30        return SYSCALL_CANCEL (wait4, pid, stat_loc, options, usage);
#1  0x0000ffff7143a148 in ggml_abort () from /home/rocky/vineel/llama.cpp/build/ggml/src/libggml.so
#2  0x0000ffff714d091c in ggml_cuda_op_rms_norm(ggml_backend_cuda_context&, ggml_tensor*) () from /home/rocky/vineel/llama.cpp/build/ggml/src/libggml.so
#3  0x0000ffff714e6fdc in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from /home/rocky/vineel/llama.cpp/build/ggml/src/libggml.so
#4  0x0000ffff714766a4 in ggml_backend_sched_graph_compute_async () from /home/rocky/vineel/llama.cpp/build/ggml/src/libggml.so
#5  0x0000ffff86dbeb7c in llama_decode () from /home/rocky/vineel/llama.cpp/build/src/libllama.so
#6  0x000000000045986c in llama_init_from_gpt_params(gpt_params&) ()
#7  0x000000000041000c in main ()
[Inferior 1 (process 3039500) detached]
Aborted (core dumped)
danbev commented 3 days ago

I think this is because the lack of support for BF16 ops on the GPU as mentioned in https://github.com/ggerganov/llama.cpp/issues/9881#issuecomment-2414516092. Perhaps you can try a F16 model instead.

vineel96 commented 3 days ago

@danbev, Even FP16 falcon mamba model suffers same error. LLAMA-3 FP16 and BF16 does not give above error when using -ngl option.

slaren commented 3 days ago

This happens because the CUDA backend does not support the norm operation on non-contiguous tensors, and it does not report it correctly in the supports_op function. This should be fixed, however, the CUDA backend also does not support the mamba specific operations, so there will be no benefit to offloading mamba models until these are implemented.

vineel96 commented 6 hours ago

@slaren, thanks. Can you specify files where cuda implementations are defined? Any plan/PR in progress for CUDA support for mamba model? Also do mamba has all kernel support for CPU specific operations (like parallel scan algorithm of mamba)?