ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.87k stars 9.46k forks source link

Bug: cant finetune #7643

Closed cabfile closed 1 month ago

cabfile commented 4 months ago

What happened?

GGML_ASSERT: D:\a\llama.cpp\llama.cpp\ggml.c:12853: ne2 == ne02

Name and Version

version: 2965 (03d8900e)
built with MSVC 19.39.33523.0 for x64

What operating system are you seeing the problem on?

Windows

Relevant log output

E:\slm\llama\other>finetune --model-base ..\..\tinyllama-1.1b-chat-v0.6-q4_0_2.g
guf --checkpoint-in chk-piss-LATEST.gguf --checkpoint-out chk-piss-ITERATION.ggu
f --lora-out piss-ITERATION.bin --train-data traindata.txt --save-every 10 --thr
eads 4 --adam-iter 30 --batch 4 --ctx 64 --use-checkpointing
main: seed: 1717079846
main: model base = '..\..\tinyllama-1.1b-chat-v0.6-q4_0_2.gguf'
llama_model_loader: loaded meta data with 21 key-value pairs and 201 tensors fro
m ..\..\tinyllama-1.1b-chat-v0.6-q4_0_2.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not appl
y in this output.
llama_model_loader: - kv   0:                       general.architecture str
          = llama
llama_model_loader: - kv   1:                               general.name str
          = models
llama_model_loader: - kv   2:                       llama.context_length u32
          = 2048
llama_model_loader: - kv   3:                     llama.embedding_length u32
          = 2048
llama_model_loader: - kv   4:                          llama.block_count u32
          = 22
llama_model_loader: - kv   5:                  llama.feed_forward_length u32
          = 5632
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32
          = 64
llama_model_loader: - kv   7:                 llama.attention.head_count u32
          = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32
          = 4
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32
          = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32
          = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32
          = 2
llama_model_loader: - kv  12:                       tokenizer.ggml.model str
          = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str
,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32
,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32
,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32
          = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32
          = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32
          = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32
          = 2
llama_model_loader: - kv  20:               general.quantization_version u32
          = 2
llama_model_loader: - type  f32:   45 tensors
llama_model_loader: - type q4_0:  155 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 259.
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_layer          = 22
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 256
llm_load_print_meta: n_embd_v_gqa     = 256
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 5632
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 1B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 1.10 B
llm_load_print_meta: model size       = 606.53 MiB (4.63 BPW)
llm_load_print_meta: general.name     = models
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.10 MiB
llm_load_tensors:        CPU buffer size =   606.53 MiB
................................................................................
.....
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =    11.00 MiB
llama_new_context_with_model: KV self size  =   11.00 MiB, K (f16):    5.50 MiB,
 V (f16):    5.50 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:        CPU compute buffer size =    66.50 MiB
llama_new_context_with_model: graph nodes  = 710
llama_new_context_with_model: graph splits = 1
main: init model
print_params: n_vocab               : 32000
print_params: n_ctx                 : 64
print_params: n_embd                : 2048
print_params: n_ff                  : 5632
print_params: n_head                : 32
print_params: n_head_kv             : 4
print_params: n_layer               : 22
print_params: norm_rms_eps          : 0.000010
print_params: rope_freq_base        : 10000.000000
print_params: rope_freq_scale       : 1.000000
print_lora_params: n_rank_attention_norm : 1
print_lora_params: n_rank_wq             : 4
print_lora_params: n_rank_wk             : 4
print_lora_params: n_rank_wv             : 4
print_lora_params: n_rank_wo             : 4
print_lora_params: n_rank_ffn_norm       : 1
print_lora_params: n_rank_ffn_gate       : 4
print_lora_params: n_rank_ffn_down       : 4
print_lora_params: n_rank_ffn_up         : 4
print_lora_params: n_rank_tok_embeddings : 4
print_lora_params: n_rank_norm           : 1
print_lora_params: n_rank_output         : 4
main: total train_iterations 0
main: seen train_samples     0
main: seen train_tokens      0
main: completed train_epochs 0
main: lora_size = 28472224 bytes (27.2 MB)
main: opt_size  = 42223360 bytes (40.3 MB)
main: opt iter 0
main: input_size = 32769056 bytes (31.3 MB)
main: compute_size = 1507336544 bytes (1437.5 MB)
main: evaluation order = RIGHT_TO_LEFT
main: tokenize training data from traindata.txt
main: sample-start:
main: include-sample-start: false
tokenize_file: total number of samples: 1
main: number of training tokens: 12
main: number of unique tokens: 12
main: train data seems to have changed. restarting shuffled epoch.
main: begin training
main: work_size = 512240 bytes (0.5 MB)
train_opt_callback: iter=     0 sample=1/1 sched=0.000000 loss=0.000000 |->
train_opt_callback: reshuffle samples. completed epochs: 1
GGML_ASSERT: D:\a\llama.cpp\llama.cpp\ggml.c:12853: ne2 == ne02
GGML_ASSERT: D:\a\llama.cpp\llama.cpp\ggml.c:12853: ne2 == ne02
GGML_ASSERT: D:\a\llama.cpp\llama.cpp\ggml.c:12853: ne2 == ne02
GGML_ASSERT: D:\a\llama.cpp\llama.cpp\ggml.c:12853: ne2 == ne02
adrian-afl commented 4 months ago

I'm facing very similar problem, here is what i try to do, its almost copy paste from the original readme for finetune: .\llama-b3058-bin-win-avx2-x64\finetune.exe --model-base .\models\llama3-8b-inst.gguf --checkpoint-in chk-lora-open-llama-3b-v2-q8_0-shakespeare-LATEST.gguf --checkpoint-out chk-lora-open-llama-3b-v2-q8_0-shakespeare-ITERATION.gguf --lora-out lora-open-llama-3b-v2-q8_0-shakespeare-ITERATION.bin --train-data "shake.txt" --save-every 10 --threads 12 --adam-iter 30 --batch 4 --ctx 64 --use-checkpointing and the result is:

...
main: number of unique tokens: 3621
main: train data seems to have changed. restarting shuffled epoch.
main: begin training
main: work_size = 6157072 bytes (5.9 MB)
train_opt_callback: iter=     0 sample=1/22783 sched=0.000000 loss=0.000000 |->
GGML_ASSERT: D:\a\llama.cpp\llama.cpp\ggml.c:12849: ne2 == ne02
GGML_ASSERT: D:\a\llama.cpp\llama.cpp\ggml.c:12849: ne2 == ne02

and then the process exits I also tried with clblast version with the same result

The model im trying to finetune is this: https://huggingface.co/SanctumAI/Meta-Llama-3-8B-Instruct-GGUF version q6_k Edit: checked q8 version of the same model, the result is the same this: sample=1/22783 starts at sample=0/22783 then switches to sample=1/22783 and few seconds later it crashes as above

alyas77 commented 4 months ago

the same issues with few other models.(phi, mistral) here is one of the examples. main .. works fine

huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct  --local-dir ~/projects/models/meta-llama/Meta-Llama-3-8B-Instruct

python ./llama.cpp/convert-hf-to-gguf.py \
    --outfile ~/projects/models/meta-llama/Meta-Llama-3-8B-Instruct/Meta-Llama-3-8B-Instruct.gguf \
    ~/projects/models/meta-llama/Meta-Llama-3-8B-Instruct --outtype=q8_0 \

./llama.cpp/main -i -m ~/projects/models/meta-llama/Meta-Llama-3-8B-Instruct/Meta-Llama-3-8B-Instruct.gguf

./llama.cpp/finetune \
--model-base ~/projects/models/meta-llama/Meta-Llama-3-8B-Instruct/Meta-Llama-3-8B-Instruct.gguf \
--checkpoint-in  chk-lora-Meta-Llama-3-8B-Instruct-shakespeare-LATEST.gguf \
--checkpoint-out chk-lora-Meta-Llama-3-8B-Instruct-shakespeare-ITERATION.gguf \
--lora-out lora-Meta-Llama-3-8B-Instruct-shakespeare-ITERATION.bin \
--train-data "shakespeare.txt" \
--save-every 10 \
--threads 40 --adam-iter 30 --batch 4 --ctx 64 \
--use-checkpointing
main: seed: 1717387035
main: model base = '/home/alyas/projects/models/meta-llama/Meta-Llama-3-8B-Instruct/Meta-Llama-3-8B-Instruct.gguf'
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /home/alyas/projects/models/meta-llama/Meta-Llama-3-8B-Instruct/Meta-Llama-3-8B-Instruct.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 7
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 1.5928 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 7.95 GiB (8.50 BPW) 
llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors:        CPU buffer size =  8137.64 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =    64.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =   258.50 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1
main: init model
print_params: n_vocab               : 128256
print_params: n_ctx                 : 64
print_params: n_embd                : 4096
print_params: n_ff                  : 14336
print_params: n_head                : 32
print_params: n_head_kv             : 8
print_params: n_layer               : 32
print_params: norm_rms_eps          : 0.000010
print_params: rope_freq_base        : 500000.000000
print_params: rope_freq_scale       : 1.000000
print_lora_params: n_rank_attention_norm : 1
print_lora_params: n_rank_wq             : 4
print_lora_params: n_rank_wk             : 4
print_lora_params: n_rank_wv             : 4
print_lora_params: n_rank_wo             : 4
print_lora_params: n_rank_ffn_norm       : 1
print_lora_params: n_rank_ffn_gate       : 4
print_lora_params: n_rank_ffn_down       : 4
print_lora_params: n_rank_ffn_up         : 4
print_lora_params: n_rank_tok_embeddings : 4
print_lora_params: n_rank_norm           : 1
print_lora_params: n_rank_output         : 4
main: total train_iterations 0
main: seen train_samples     0
main: seen train_tokens      0
main: completed train_epochs 0
main: lora_size = 94956320 bytes (90.6 MB)
main: opt_size  = 141731824 bytes (135.2 MB)
main: opt iter 0
main: input_size = 131335200 bytes (125.3 MB)
main: compute_size = 6164070752 bytes (5878.5 MB)
main: evaluation order = RIGHT_TO_LEFT
main: tokenize training data from shakespeare.txt
main: sample-start: 
main: include-sample-start: false
tokenize_file: total number of samples: 22783
main: number of training tokens: 22847
main: number of unique tokens: 3621
main: train data seems to have changed. restarting shuffled epoch.
main: begin training
main: work_size = 20523648 bytes (19.6 MB)
train_opt_callback: iter=     0 sample=1/22783 sched=0.000000 loss=0.000000 |->
GGML_ASSERT: ggml.c:12849: ne2 == ne02
GGML_ASSERT: ggml.c:12849: ne2 == ne02
GGML_ASSERT: ggml.c:12849: ne2 == ne02
GGML_ASSERT: ggml.c:12849: ne2 == ne02
GGML_ASSERT: ggml.c:12849: ne2 == ne02
GGML_ASSERT: ggml.c:12849: ne2 == ne02
GGML_ASSERT: ggml.c:12849: ne2 == ne02
GGML_ASSERT: ggml.c:12849: ne2 == ne02
GGML_ASSERT: ggml.c:12849: ne2 == ne02
GGML_ASSERT: ggml.c:12849: ne2 == ne02
ggerganov commented 4 months ago

Find the first commit that stops working

opensignature commented 4 months ago

Find the first commit that stops working

git reset --hard HEAD~100
HEAD is now at efc8f767 move ndk code to a new library (#6951)

with this commit work: ... train_opt_callback: iter= 0 sample=1/28013 sched=0.000000 loss=0.000000 |> train_opt_callback: iter= 1 sample=5/28013 sched=0.010000 loss=9.638092 ... with HEAD~99 not work

ggerganov commented 4 months ago

Does it work with -nkvo?

opensignature commented 4 months ago

Does it work with -nkvo?

I don't think -nkvo parameter is present in finetune. However I recompiled everything, forcing bool no_kv_offload = true; in common.h but it still doesn't work

opensignature commented 4 months ago

if it helps, from commit c4ec9c0d to commit 3cbd23ed the error is different, and it is this: GGML_ASSERT: examples/finetune/finetune.cpp:646: false && "TODO: ggml_flash_attn_ext() not yet supported"

LucaKoval commented 4 months ago

Any updates on this issue? I'm facing the same problem unfortunately.

hwiorn commented 4 months ago

if it helps, from commit c4ec9c0 to commit 3cbd23e the error is different, and it is this: GGML_ASSERT: examples/finetune/finetune.cpp:646: false && "TODO: ggml_flash_attn_ext() not yet supported"

~This is not related. There is the fix for flash_attention flag(ref: 9588f196b1d7b21bdff013fcf958c249576b2619) for it a few days ago. Now, default is false. You can use --no-flash option.~

UPDATED: Actually, flash_attention is related to this issue. See below comments.

hwiorn commented 4 months ago

I have a same issue on Linux. Llama3-finetuned models always get this error, but prediction(main -m <model_gguf>) is okay.

Only fine-tuning open_llama_3b_v2 model works okay.

ggerganov commented 4 months ago

Find the first commit that causes GGML_ASSERT: ggml.c:12849: ne2 == ne02

hwiorn commented 4 months ago

I'm git-bisecting this. Quite hard to find.

--no-flash option makes the below error too on efc8f767. Without that option, finetuning seems to work.

GGML_ASSERT: llama.cpp/ggml.c:12262: ne2 == ne02

Latest commits(--no-flash is default according to 9588f196b1d7b21bdff013fcf958c249576b2619) make the below error without --no-flash option. So, d48c88cbd563b6cf0ce972e2f56796896e240736 can cause this issue.

GGML_ASSERT: llama.cpp/examples/finetune/finetune.cpp:646: false && "TODO: ggml_flash_attn_ext() not yet supported"

When I re-build e84b71c2c6da6e69c8f815168ea836f9716a325e and run it, training works. But, I'm not sure if it works properly because FA-related commits were merged frequently.

cd llama.cpp
git checkout d48c88cbd563b6cf0ce972e2f56796896e240736^
rm -rf build
cmake -B build -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS
cmake --build build --config Debug -j
build/bin/finetune \
  --model-base $model \
  --train-data shakespeare.txt \
  --lora-out lora.gguf \
  --seed 1

@ggerganov Check out d48c88cbd563b6cf0ce972e2f56796896e240736

hwiorn commented 4 months ago

With --no-flash on d48c88cbd563b6cf0ce972e2f56796896e240736,

GGML_ASSERT: llama.cpp/ggml.c:12809: ne2 == ne02
GGML_ASSERT: llama-cpp/llama.cpp/ggml.c:12809: ne2 == ne02
GGML_ASSERT: llama-cpp/llama.cpp/ggml.c:12809: ne2 == ne02
GGML_ASSERT: llama-cpp/llama.cpp/ggml.c:12809: ne2 == ne02
GGML_ASSERT: llama-cpp/llama.cpp/ggml.c:12809: ne2 == ne02
GGML_ASSERT: llama-cpp/llama.cpp/ggml.c:12809: ne2 == ne02

Without --no-flash(default) on d48c88cbd563b6cf0ce972e2f56796896e240736,

GGML_ASSERT: llama.cpp/examples/finetune/finetune.cpp:646: false && "TODO: ggml_flash_attn_ext() not yet supported"
hwiorn commented 4 months ago

Related to #7523

opensignature commented 4 months ago

I think we should try with small base models and scale up to those that cause problems. For example with https://huggingface.co/Maykeye/TinyLLama-v0 finetune works correctly

Spider-netizen commented 2 months ago

@hwiorn

Did you find anything? The problem is only with llama3 as it seems... Did it ever used to work with llama3 models?

github-actions[bot] commented 1 month ago

This issue was closed because it has been inactive for 14 days since being marked as stale.