Segmentation Fault on GPU

djain-fujitsu commented 1 month ago

When I am trying to run the following finetuning command on GPU: nohup ../build/bin/finetune --model-base llama-3b-Q5_0.gguf --train-data "shakespeare.txt" --save-every 1 --adam-iter 2 --batch 4 --ctx 4 --lora-out ../../training/checkpoints/llama_3b_q5_ctx_4_batch_4_threads_6/lora.bin --checkpoint-in ../../training/checkpoints/llama_3b_q5_ctx_4_batch_4_threads_6/checkpoint.gguf --checkpoint-out ../../training/checkpoints/llama_3b_q5_ctx_4_batch_4_threads_6/checkpoint-ITERATION.gguf > ../../training/checkpoints/llama_3b_q5_ctx_4_batch_4_threads_6/training_logs.out -ngl 33

I get segmentation fault error with ever increasing nohup.out file:

llama_model_loader: loaded meta data with 24 key-value pairs and 237 tensors from llama-3b-Q5_0.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = models llama_model_loader: - kv 2: llama.context_length u32 = 2048 llama_model_loader: - kv 3: llama.embedding_length u32 = 3200 llama_model_loader: - kv 4: llama.block_count u32 = 26 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 8640 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 100 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 11: general.file_type u32 = 8 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "~~", "~~", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,60820] = ["▁ t", "▁ a", "i n", "h e", "▁... llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 21: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 22: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 53 tensors llama_model_loader: - type q5_0: 183 tensors llama_model_loader: - type q8_0: 1 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 2048 llm_load_print_meta: n_embd = 3200 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 32 llm_load_print_meta: n_layer = 26 llm_load_print_meta: n_rot = 100 llm_load_print_meta: n_embd_head_k = 100 llm_load_print_meta: n_embd_head_v = 100 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 3200 llm_load_print_meta: n_embd_v_gqa = 3200 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 8640 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 2048 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 3B llm_load_print_meta: model ftype = Q5_0 llm_load_print_meta: model params = 3.43 B llm_load_print_meta: model size = 2.23 GiB (5.59 BPW) llm_load_print_meta: general.name = models llm_load_print_meta: BOS token = 1 '~~' llm_load_print_meta: EOS token = 2 '~~' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: PAD token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA T4G, compute capability 7.5, VMM: yes llm_load_tensors: ggml ctx size = 0.24 MiB llm_load_tensors: offloading 26 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 27/27 layers to GPU llm_load_tensors: CPU buffer size = 67.14 MiB llm_load_tensors: CUDA0 buffer size = 2216.65 MiB ............................................................................................... llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 162.50 MiB llama_new_context_with_model: KV self size = 162.50 MiB, K (f16): 81.25 MiB, V (f16): 81.25 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.12 MiB llama_new_context_with_model: CUDA0 compute buffer size = 68.75 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 7.26 MiB llama_new_context_with_model: graph nodes = 838 llama_new_context_with_model: graph splits = 2 main: seed: 1715928042 main: model base = 'llama-3b-Q5_0.gguf' main: init model print_params: n_vocab : 32000 print_params: n_ctx : 4 print_params: n_embd : 3200 print_params: n_ff : 8640 print_params: n_head : 32 print_params: n_head_kv : 32 print_params: n_layer : 26 print_params: norm_rms_eps : 0.000001 print_params: rope_freq_base : 10000.000000 print_params: rope_freq_scale : 1.000000 print_lora_params: n_rank_attention_norm : 1 print_lora_params: n_rank_wq : 4 print_lora_params: n_rank_wk : 4 print_lora_params: n_rank_wv : 4 print_lora_params: n_rank_wo : 4 print_lora_params: n_rank_ffn_norm : 1 print_lora_params: n_rank_ffn_gate : 4 print_lora_params: n_rank_ffn_down : 4 print_lora_params: n_rank_ffn_up : 4 print_lora_params: n_rank_tok_embeddings : 4 print_lora_params: n_rank_norm : 1 print_lora_params: n_rank_output : 4 main: total train_iterations 0 main: seen train_samples 0 main: seen train_tokens 0 main: completed train_epochs 0 main: lora_size = 54844064 bytes (52.3 MB) main: opt_size = 81694048 bytes (77.9 MB) main: opt iter 0 main: input_size = 2048096 bytes (2.0 MB) main: compute_size = 846062208 bytes (806.9 MB) main: evaluation order = RIGHT_TO_LEFT main: tokenize training data from shakespeare.txt main: sample-start: main: include-sample-start: false tokenize_file: total number of samples: 26826 main: number of training tokens: 26830 main: number of unique tokens: 3320 main: train data seems to have changed. restarting shuffled epoch. main: begin training main: work_size = 768376 bytes (0.7 MB) train_opt_callback: iter= 0 sample=1/26826 sched=0.000000 loss=0.000000 |-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

It get's stuck on '-' character and keeps on printing that without any progress and leads to segmentation fault finally

arnfaldur commented 1 month ago

The bug template says.

Please include information about your system, the steps to reproduce the bug, and the version of llama.cpp that you are using. If possible, please provide a minimal code example that reproduces the bug.

Why did you ignore it?

djain-fujitsu commented 1 month ago

The bug template says.

Please include information about your system, the steps to reproduce the bug, and the version of llama.cpp that you are using. If possible, please provide a minimal code example that reproduces the bug.

Why did you ignore it?

Thanks @arnfaldur for the revert, here the details about the system, steps to reproduce the bug:

System Info: Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian

CPU(s): 8 On-line CPU(s) list: 0-7

Vendor ID: ARM Model name: Neoverse-N1 Model: 1 Thread(s) per core: 1 Core(s) per socket: 8 Socket(s): 1 Stepping: r3p1 BogoMIPS: 243.75 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssb s Caches (sum of all):
L1d: 512 KiB (8 instances) L1i: 512 KiB (8 instances) L2: 8 MiB (8 instances) L3: 32 MiB (1 instance)

NUMA:
NUMA node(s): 1 NUMA node0 CPU(s): 0-7

RAM = 16GB

GPU(s): Model : NVIDIA T4G Driver Version: 545.23.08
CUDA Version: 12.3 Memory : 16GB

llama.cpp version : Not sure as I followed all the steps on the github README.md(would appreaciate if someone can guide me on how to obtain it)

cmake version : 3.22.1

Steps to be followed for reproducing the bug:

git clone https://github.com/ggerganov/llama.cpp.git
cd ..../llama.cpp/
Building binaries:

mkdir build cd build cmake .. -DLLAMA_CUDA=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc cmake --build . --config Release cd ..
cd ./models
Download shakespeare text file : wget https://raw.githubusercontent.com/brunoklein99/deep-learning-notes/master/shakespeare.txt
Download GGUF file from the following link : wget -r https://huggingface.co/gultar/OpenHermes-Llama-3b-GGUF/tree/main
nohup ../build/bin/finetune --model-base llama-3b-Q5_0.gguf --train-data "shakespeare.txt" --save-every 1 --adam-iter 2 --batch 4 --ctx 4

arnfaldur commented 1 month ago

Why did you ignore it?

I got a little snarky, I'm sorry about that

If you run the main executable or the server, it prints the build number like so:

$ ./main
Log start
main: build = 2936 (5ca49cbe)

or

$ ./server
{"tid":"124519658393600","timestamp":1716210643,"level":"INFO","function":"main","line":2943,"msg":"build info","build":2936,"commit":"5ca49cbe"}

I'm afraid I don't know much about the training logic so I can't help you there.

djain-fujitsu commented 1 month ago

Why did you ignore it?

I got a little snarky, I'm sorry about that

If you run the main executable or the server, it prints the build number like so:
$ ./main
Log start
main: build = 2936 (5ca49cbe)
or
$ ./server
{"tid":"124519658393600","timestamp":1716210643,"level":"INFO","function":"main","line":2943,"msg":"build info","build":2936,"commit":"5ca49cbe"}
I'm afraid I don't know much about the training logic so I can't help you there.

It's ok buddy. So I got this as my build number:

$ ./main

Log start main: build = 2782 (60325fa5) main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for aarch64-linux-gnu

arnfaldur commented 1 month ago

That's a fairly new build. It can't hurt updating to the latest and retrying, this repo is moving very fast. It's worth the shot but might not help though.

djain-fujitsu commented 1 month ago

That's a fairly new build. It can't hurt updating to the latest and retrying, this repo is moving very fast. It's worth the shot but might not help though.

Ok so do you mean I should give it a try to the below build:

$ ./main Log start main: build = 2936 (5ca49cbe)

arnfaldur commented 1 month ago

Yes. I mean that it's worth trying if it's not a lot of work. Iit's not very likely to solve the issue but there's a chance.

djain-fujitsu commented 3 weeks ago

I tried latest build as well but still getting the same error

djain-fujitsu commented 1 week ago

Still not solved and getting the same error.

ggerganov / llama.cpp

Segmentation Fault on GPU #7337

$ ./main