Open disperaller opened 6 months ago
This is a weird issue. I'll try it myself very soon. Could you please share your training log? Is the loss for more than 1 epoch normal?
This is a weird issue. I'll try it myself very soon. Could you please share your training log? Is the loss for more than 1 epoch normal?
Hi i overwrote the log with that 1-epoch model's training log, but if you try the following setup, i believe you could reproduce the error:
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 export NCCL_BLOCKING_WAIT=0 export NCCL_DEBUG=INFO export OMP_NUM_THREADS=1
output_name=l3_8b_1epoch_8gpu_t1
unsloth/bin/torchrun \ --master_addr localhost \ --master_port 6667 \ --nnodes 1 \ --node_rank 0 \ --nproc_per_node 8 \ train.py \ --data_root data/long-llm \ --output_dir model/$output_name \ --model_name_or_path /mnt/cpfs-data/mashiyao/MODEL/LLaMa3-Instruct \ --train_data "long-llm:gpt/one_detail_book.train.64K.json long-llm:gpt/one_detail_paper.train.64K.json long-llm:gpt/multi_detail_book.train.json long-llm:gpt/multi_detail_paper_short.train.json long-llm:gpt/multi_detail_paper_long.train.json long-llm:gpt/bio_book.train.json long-llm:longalpaca/train.json long-llm:redpajama/train.json[5000]" \ --max_length 81920 \ --group_by_length \ --rope_theta 200e6 \ --attn_impl flash_attention_2 \ --gradient_checkpointing \ --use_reentrant True \ --learning_rate 5e-5 \ --num_train_epochs 2 \ --save_only_model \ --save_strategy epoch \ --logging_steps 5 \ --bf16 \ --lora_tune \ --lora_extra_params embed_tokens \ --load_in_4_bit \ --chat_template llama-3
Also, i wanna ask if this training method is able to scale to 70b llama3 instruct model using the same hyper-parameter setting?
I think it would work. Please report your result here if you would like to have a try :)
This is a weird issue. I'll try it myself very soon. Could you please share your training log? Is the loss for more than 1 epoch normal?
@namespace-Pt Hi, have you tried finetuning in this way and encounter the same issue?
If we following the script setting of long-llm, the parameter num_train_epoch is set to 1, it will give out really significant improvment over the original model. However, if we change the paramter to larger than 1 ( i've tried 2, 3). The resulting model is total garbage. The first picture shows the prediction of some prompt using the model being trained for only 1 epoch. The second picture shows the same prompt's prediction using the model with 3 epochs. Something is not right here, as i don't believe more epochs will lead to dramatically worse result.
In addition, i've tried the following:
Really appreciate if someone could provide some insight on this, thanks in advance.