Closed cabfile closed 1 month ago
I'm facing very similar problem, here is what i try to do, its almost copy paste from the original readme for finetune:
.\llama-b3058-bin-win-avx2-x64\finetune.exe --model-base .\models\llama3-8b-inst.gguf --checkpoint-in chk-lora-open-llama-3b-v2-q8_0-shakespeare-LATEST.gguf --checkpoint-out chk-lora-open-llama-3b-v2-q8_0-shakespeare-ITERATION.gguf --lora-out lora-open-llama-3b-v2-q8_0-shakespeare-ITERATION.bin --train-data "shake.txt" --save-every 10 --threads 12 --adam-iter 30 --batch 4 --ctx 64 --use-checkpointing
and the result is:
...
main: number of unique tokens: 3621
main: train data seems to have changed. restarting shuffled epoch.
main: begin training
main: work_size = 6157072 bytes (5.9 MB)
train_opt_callback: iter= 0 sample=1/22783 sched=0.000000 loss=0.000000 |->
GGML_ASSERT: D:\a\llama.cpp\llama.cpp\ggml.c:12849: ne2 == ne02
GGML_ASSERT: D:\a\llama.cpp\llama.cpp\ggml.c:12849: ne2 == ne02
and then the process exits I also tried with clblast version with the same result
The model im trying to finetune is this: https://huggingface.co/SanctumAI/Meta-Llama-3-8B-Instruct-GGUF version q6_k
Edit: checked q8 version of the same model, the result is the same
this: sample=1/22783
starts at sample=0/22783
then switches to sample=1/22783
and few seconds later it crashes as above
the same issues with few other models.(phi, mistral) here is one of the examples. main .. works fine
huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct --local-dir ~/projects/models/meta-llama/Meta-Llama-3-8B-Instruct
python ./llama.cpp/convert-hf-to-gguf.py \
--outfile ~/projects/models/meta-llama/Meta-Llama-3-8B-Instruct/Meta-Llama-3-8B-Instruct.gguf \
~/projects/models/meta-llama/Meta-Llama-3-8B-Instruct --outtype=q8_0 \
./llama.cpp/main -i -m ~/projects/models/meta-llama/Meta-Llama-3-8B-Instruct/Meta-Llama-3-8B-Instruct.gguf
./llama.cpp/finetune \
--model-base ~/projects/models/meta-llama/Meta-Llama-3-8B-Instruct/Meta-Llama-3-8B-Instruct.gguf \
--checkpoint-in chk-lora-Meta-Llama-3-8B-Instruct-shakespeare-LATEST.gguf \
--checkpoint-out chk-lora-Meta-Llama-3-8B-Instruct-shakespeare-ITERATION.gguf \
--lora-out lora-Meta-Llama-3-8B-Instruct-shakespeare-ITERATION.bin \
--train-data "shakespeare.txt" \
--save-every 10 \
--threads 40 --adam-iter 30 --batch 4 --ctx 64 \
--use-checkpointing
main: seed: 1717387035
main: model base = '/home/alyas/projects/models/meta-llama/Meta-Llama-3-8B-Instruct/Meta-Llama-3-8B-Instruct.gguf'
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /home/alyas/projects/models/meta-llama/Meta-Llama-3-8B-Instruct/Meta-Llama-3-8B-Instruct.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv 2: llama.block_count u32 = 32
llama_model_loader: - kv 3: llama.context_length u32 = 8192
llama_model_loader: - kv 4: llama.embedding_length u32 = 4096
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.attention.head_count u32 = 32
llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 7
llama_model_loader: - kv 11: llama.vocab_size u32 = 128256
llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv 21: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q8_0: 226 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 1.5928 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: n_ctx_train = 8192
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 8192
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 8B
llm_load_print_meta: model ftype = Q8_0
llm_load_print_meta: model params = 8.03 B
llm_load_print_meta: model size = 7.95 GiB (8.50 BPW)
llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_tensors: ggml ctx size = 0.15 MiB
llm_load_tensors: CPU buffer size = 8137.64 MiB
.........................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 64.00 MiB
llama_new_context_with_model: KV self size = 64.00 MiB, K (f16): 32.00 MiB, V (f16): 32.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.49 MiB
llama_new_context_with_model: CPU compute buffer size = 258.50 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 1
main: init model
print_params: n_vocab : 128256
print_params: n_ctx : 64
print_params: n_embd : 4096
print_params: n_ff : 14336
print_params: n_head : 32
print_params: n_head_kv : 8
print_params: n_layer : 32
print_params: norm_rms_eps : 0.000010
print_params: rope_freq_base : 500000.000000
print_params: rope_freq_scale : 1.000000
print_lora_params: n_rank_attention_norm : 1
print_lora_params: n_rank_wq : 4
print_lora_params: n_rank_wk : 4
print_lora_params: n_rank_wv : 4
print_lora_params: n_rank_wo : 4
print_lora_params: n_rank_ffn_norm : 1
print_lora_params: n_rank_ffn_gate : 4
print_lora_params: n_rank_ffn_down : 4
print_lora_params: n_rank_ffn_up : 4
print_lora_params: n_rank_tok_embeddings : 4
print_lora_params: n_rank_norm : 1
print_lora_params: n_rank_output : 4
main: total train_iterations 0
main: seen train_samples 0
main: seen train_tokens 0
main: completed train_epochs 0
main: lora_size = 94956320 bytes (90.6 MB)
main: opt_size = 141731824 bytes (135.2 MB)
main: opt iter 0
main: input_size = 131335200 bytes (125.3 MB)
main: compute_size = 6164070752 bytes (5878.5 MB)
main: evaluation order = RIGHT_TO_LEFT
main: tokenize training data from shakespeare.txt
main: sample-start:
main: include-sample-start: false
tokenize_file: total number of samples: 22783
main: number of training tokens: 22847
main: number of unique tokens: 3621
main: train data seems to have changed. restarting shuffled epoch.
main: begin training
main: work_size = 20523648 bytes (19.6 MB)
train_opt_callback: iter= 0 sample=1/22783 sched=0.000000 loss=0.000000 |->
GGML_ASSERT: ggml.c:12849: ne2 == ne02
GGML_ASSERT: ggml.c:12849: ne2 == ne02
GGML_ASSERT: ggml.c:12849: ne2 == ne02
GGML_ASSERT: ggml.c:12849: ne2 == ne02
GGML_ASSERT: ggml.c:12849: ne2 == ne02
GGML_ASSERT: ggml.c:12849: ne2 == ne02
GGML_ASSERT: ggml.c:12849: ne2 == ne02
GGML_ASSERT: ggml.c:12849: ne2 == ne02
GGML_ASSERT: ggml.c:12849: ne2 == ne02
GGML_ASSERT: ggml.c:12849: ne2 == ne02
Find the first commit that stops working
Find the first commit that stops working
git reset --hard HEAD~100
HEAD is now at efc8f767 move ndk code to a new library (#6951)
with this commit work: ... train_opt_callback: iter= 0 sample=1/28013 sched=0.000000 loss=0.000000 |> train_opt_callback: iter= 1 sample=5/28013 sched=0.010000 loss=9.638092 ... with HEAD~99 not work
Does it work with -nkvo
?
Does it work with
-nkvo
?
I don't think -nkvo parameter is present in finetune. However I recompiled everything, forcing bool no_kv_offload = true; in common.h but it still doesn't work
if it helps, from commit c4ec9c0d to commit 3cbd23ed the error is different, and it is this: GGML_ASSERT: examples/finetune/finetune.cpp:646: false && "TODO: ggml_flash_attn_ext() not yet supported"
Any updates on this issue? I'm facing the same problem unfortunately.
if it helps, from commit c4ec9c0 to commit 3cbd23e the error is different, and it is this: GGML_ASSERT: examples/finetune/finetune.cpp:646: false && "TODO: ggml_flash_attn_ext() not yet supported"
~This is not related. There is the fix for flash_attention flag(ref: 9588f196b1d7b21bdff013fcf958c249576b2619) for it a few days ago. Now, default is false. You can use --no-flash
option.~
UPDATED: Actually, flash_attention is related to this issue. See below comments.
I have a same issue on Linux. Llama3-finetuned models always get this error, but prediction(main -m <model_gguf>
) is okay.
Only fine-tuning open_llama_3b_v2
model works okay.
Find the first commit that causes GGML_ASSERT: ggml.c:12849: ne2 == ne02
I'm git-bisecting this. Quite hard to find.
--no-flash
option makes the below error too on efc8f767. Without that option, finetuning seems to work.
GGML_ASSERT: llama.cpp/ggml.c:12262: ne2 == ne02
Latest commits(--no-flash
is default according to 9588f196b1d7b21bdff013fcf958c249576b2619) make the below error without --no-flash
option. So, d48c88cbd563b6cf0ce972e2f56796896e240736 can cause this issue.
GGML_ASSERT: llama.cpp/examples/finetune/finetune.cpp:646: false && "TODO: ggml_flash_attn_ext() not yet supported"
When I re-build e84b71c2c6da6e69c8f815168ea836f9716a325e and run it, training works. But, I'm not sure if it works properly because FA-related commits were merged frequently.
cd llama.cpp
git checkout d48c88cbd563b6cf0ce972e2f56796896e240736^
rm -rf build
cmake -B build -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS
cmake --build build --config Debug -j
build/bin/finetune \
--model-base $model \
--train-data shakespeare.txt \
--lora-out lora.gguf \
--seed 1
@ggerganov Check out d48c88cbd563b6cf0ce972e2f56796896e240736
With --no-flash
on d48c88cbd563b6cf0ce972e2f56796896e240736,
GGML_ASSERT: llama.cpp/ggml.c:12809: ne2 == ne02
GGML_ASSERT: llama-cpp/llama.cpp/ggml.c:12809: ne2 == ne02
GGML_ASSERT: llama-cpp/llama.cpp/ggml.c:12809: ne2 == ne02
GGML_ASSERT: llama-cpp/llama.cpp/ggml.c:12809: ne2 == ne02
GGML_ASSERT: llama-cpp/llama.cpp/ggml.c:12809: ne2 == ne02
GGML_ASSERT: llama-cpp/llama.cpp/ggml.c:12809: ne2 == ne02
Without --no-flash(default)
on d48c88cbd563b6cf0ce972e2f56796896e240736,
GGML_ASSERT: llama.cpp/examples/finetune/finetune.cpp:646: false && "TODO: ggml_flash_attn_ext() not yet supported"
Related to #7523
I think we should try with small base models and scale up to those that cause problems. For example with https://huggingface.co/Maykeye/TinyLLama-v0 finetune works correctly
@hwiorn
Did you find anything? The problem is only with llama3 as it seems... Did it ever used to work with llama3 models?
This issue was closed because it has been inactive for 14 days since being marked as stale.
What happened?
GGML_ASSERT: D:\a\llama.cpp\llama.cpp\ggml.c:12853: ne2 == ne02
Name and Version
What operating system are you seeing the problem on?
Windows
Relevant log output