FlagOpen / FlagEmbedding

Retrieval and Retrieval-augmented LLMs
MIT License
7.51k stars 540 forks source link

long-llm run for more than 1 epoch #855

Open disperaller opened 5 months ago

disperaller commented 5 months ago

If we following the script setting of long-llm, the parameter num_train_epoch is set to 1, it will give out really significant improvment over the original model. However, if we change the paramter to larger than 1 ( i've tried 2, 3). The resulting model is total garbage. The first picture shows the prediction of some prompt using the model being trained for only 1 epoch. The second picture shows the same prompt's prediction using the model with 3 epochs. Something is not right here, as i don't believe more epochs will lead to dramatically worse result. image image

In addition, i've tried the following:

  1. train the lora adapter based on the original model for 1 epoch
  2. merge the adapter back to the original model, call it model 1
  3. train another lora adapter based on model 1 for 1 epoch also
  4. merge back the adapter to model 1, call it model 2
  5. evaluate both model 1 and model 2, model 1 shows really good result compare to original base model, whereas model 2 is also a garbage, spitting out nonsense, repetitive results ( put the picture down below) Does anyone know why this happens? All other parameters remain the same for all the experiments. Seems like this long-llm script is only working on ONLY 1 epoch setting, weird. image

Really appreciate if someone could provide some insight on this, thanks in advance.

namespace-Pt commented 5 months ago

This is a weird issue. I'll try it myself very soon. Could you please share your training log? Is the loss for more than 1 epoch normal?

disperaller commented 5 months ago

This is a weird issue. I'll try it myself very soon. Could you please share your training log? Is the loss for more than 1 epoch normal?

Hi i overwrote the log with that 1-epoch model's training log, but if you try the following setup, i believe you could reproduce the error: image

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 export NCCL_BLOCKING_WAIT=0 export NCCL_DEBUG=INFO export OMP_NUM_THREADS=1

output_name=l3_8b_1epoch_8gpu_t1

unsloth/bin/torchrun \ --master_addr localhost \ --master_port 6667 \ --nnodes 1 \ --node_rank 0 \ --nproc_per_node 8 \ train.py \ --data_root data/long-llm \ --output_dir model/$output_name \ --model_name_or_path /mnt/cpfs-data/mashiyao/MODEL/LLaMa3-Instruct \ --train_data "long-llm:gpt/one_detail_book.train.64K.json long-llm:gpt/one_detail_paper.train.64K.json long-llm:gpt/multi_detail_book.train.json long-llm:gpt/multi_detail_paper_short.train.json long-llm:gpt/multi_detail_paper_long.train.json long-llm:gpt/bio_book.train.json long-llm:longalpaca/train.json long-llm:redpajama/train.json[5000]" \ --max_length 81920 \ --group_by_length \ --rope_theta 200e6 \ --attn_impl flash_attention_2 \ --gradient_checkpointing \ --use_reentrant True \ --learning_rate 5e-5 \ --num_train_epochs 2 \ --save_only_model \ --save_strategy epoch \ --logging_steps 5 \ --bf16 \ --lora_tune \ --lora_extra_params embed_tokens \ --load_in_4_bit \ --chat_template llama-3

disperaller commented 4 months ago

Also, i wanna ask if this training method is able to scale to 70b llama3 instruct model using the same hyper-parameter setting?

namespace-Pt commented 4 months ago

I think it would work. Please report your result here if you would like to have a try :)

disperaller commented 3 months ago

This is a weird issue. I'll try it myself very soon. Could you please share your training log? Is the loss for more than 1 epoch normal?

@namespace-Pt Hi, have you tried finetuning in this way and encounter the same issue?