OpenLMLab / LOMO

LOMO: LOw-Memory Optimization
MIT License
974 stars 68 forks source link

Performance Model after Full Fine-tuning by LOMOTrainer #25

Open dat-browny opened 1 year ago

dat-browny commented 1 year ago

I have completely fine-tuning my model by LOMO, more details i'm using bloomz-7b1-mt as the backbone and finetuning it on Alpaca Instruction Dataset. I'm using my own data processing pipeline and just replace the Trainer to your LOMOTrainer. However, the result I receive when using your directory is quite bad, my inference result is generate many messy characters while full-finetuning or LoRA on normal way not show that, and I ensure the messy character not in my training Dataset. I think the problem is on the Optimizer and Config when training. Can you see a little bit about my scripts training?

CUDA_VISIBLE_DEVICES=0,1 WANDB_DISABLED=True deepspeed --master_port=19121 train.py \
    --deepspeed config/ds_config.json \
###my config
    --model_name_or_path bigscience/bloomz-7b1-mt \
    --data_path /home/jovyan/vol-1/dat/data/alpaca/alpaca_vi_expert_conversation.json \
    --output_dir ~/vol-1/dat/checkpoints/bloom-v5.2-lomo-full-ft \
    --model_max_length 1024 
###your config
    --do_train True \
    --do_eval False \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "epoch" \
    --save_steps 10000 \
    --save_total_limit 10 \
    --learning_rate 3e-2 \
    --weight_decay 0 \
    --warmup 0.01 \
    --lr_scheduler_type "linear" \
    --clip_grad_norm  1.0 \
    --logging_steps 100 \

I'm not using your own file train.py, but just modify the data processing pipeline, replace your 'DataArgument', 'ModelArgument' but keep 'MyTrainingArgument' to LOMOTrainer. And my deepspeed config file:

{
    "bf16": {
        "enabled": false
    },
    "fp16": {
        "enabled": true
    },
    "zero_allow_untested_optimizer": true,
    "zero_force_ds_cpu_optimizer": false,
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
             "device": "cpu",
             "pin_memory": true
         },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e8,
        "stage3_max_live_parameters": 1e8,
        "stage3_max_reuse_distance": 1e8,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "gradient_accumulation_steps": 1,
    "steps_per_print": 2000,
    "train_micro_batch_size_per_gpu": 2,
    "wall_clock_breakdown": false
}

The reason I dont add the evaluate dataset to tracking the model training rightway cause the Instruction Following finetuning is not have a clearly benchmark. For all important config like learning rate, weight_decay, lr_scheduler_type,... I take it from your config sample file, but I see that in args_lomo.yaml, you dont set the optimize function but args_lomo_lora.yaml, it is 'SGD'. So as my thought, when training without LoRA, you are using the default optimize function of Seq2SeqArguments is Adam and when change to LOMO + LoRA, it will be SGD for pretrained weights and AdamW for LoRA. Am I understand it right?

The final question, you can help me address issue when full finetuning when the result is generate more messy character and optimize the config training of mine? I'm using 2 A100s 40GB.

QipengGuo commented 1 year ago

Hi, can you provide more training details, for example, the training loss curve? If the loss jitters a lot, maybe the best choice is to use a lower learning rate, say 1e-3. One possible verification is to check the data preprocessing. You can try to print the input_ids in the model's forward function and decode it to the natural language for verification. Another possible check is to load the model and do the eval step without any tuning, and this is often a good way to detect data preprocessing and generation errors.

dat-browny commented 1 year ago

Hmm, I have disabled wandb then I can show you my loss curve, did you think 1e-3 still have to high for a learning rate, often when I tuning, it's usually about 1e-5. For the try recall the preprocessing function I will try it and update the result here.

QipengGuo commented 1 year ago

1e-5 is common for Adam. The scale of learning rate for Adam is often much smaller than SGD since the Adam will rescale its learning rate before it has enough steps to compute the actual momentum.

dat-browny commented 1 year ago

As I mentioned in above, i see in your config full finetune, you dont set the optimize for this case, show the final when I Full-finetune without LoRA, must i have set the optimize to SGD, cause default it's Adam? And if i set optimize to SGD, the training process will be faster?

QipengGuo commented 1 year ago

Please check this line, https://github.com/OpenLMLab/LOMO/blob/main/src/train_lomo.py#L107

AibibulaAtawula commented 1 year ago

@dat-browny Hello bro. I am doing similar work with you. Can we discuss the training details further?

dat-browny commented 1 year ago

@dat-browny Hello bro. I am doing similar work with you. Can we discuss the training details further?

Okay, can you tell me your problem?

AibibulaAtawula commented 1 year ago

I want to utulize the optimizer to my other project. I was failed, I guess the reason is I dont know the related code.

dat-browny commented 1 year ago

For my experience, I think the problem is on learning rate, It's not scale like Adam then you must choose a suitable for your training target. I had check all my checkpoints at all learning rate I had tried. The best checkpoints in my opinion is the first epoch checkpoints when I'm using learning rate = 3e-2. With this learning rate, all checkpoints after it (ep2, ep3) will generate many messy character as I say in the 1st comment.

I have try two different learning rate: 1e-3, 1e-5. Seem like even 1e-3 still small cause at the 3rd epoch checkpoints, the model even cannot learning the pattern of training datasets.

Conclusion, depend on your objective training you must choose suitable learning rate around [1e-3, 3e-2], you can try about 5e-3 -> 7e-3. Training this task take for me many times so I had dropped idea training with LOMO. But if you try a method that have a significant success, you can suggest me to do same as you done.