loss does not decrease - Githubissues

wangfengjuan commented 1 week ago

Hello, thank you very much for sharing your work. In the TinyLLaVA_Factory-main file, I executed bash ./scripts/train/train_phi.sh and found a problem. Its loss has been around 5 and has not decreased. The final fine-tuned model effect is not good. The result of testvqa verification is 7.85. It's very strange. I don't know what went wrong? The executed script and results are shown in the figure below. Looking forward to your reply.

YingHuTsing commented 1 week ago

Are these pretrain-stage loss? For the Phi-2 LLM, final loss in the pretrain-stage could reach to about 2.5. From the screenshot you gave above, grad-norm is 0, which indicates the network is not learning and gradient is 0 and parameters are not updated any more. Did you change any hyper-params in the pretrain.sh/finetune.sh?

wangfengjuan commented 1 week ago

This is the loss in the fine-tuning phase. I only changed the batch_size in the hyperparameters. I used four 3090 GPUs. I don't know where the problem lies.

deepspeed --include localhost:0,1,2,3 --master_port 29501 tinyllava/train/train.py \ --deepspeed ./scripts/zero3.json \ --data_path $DATA_PATH \ --image_folder $IMAGE_PATH \ --is_multimodal True \ --conv_version $CONV_VERSION \ --model_name_or_path $LLM_VERSION \ --vision_tower $VT_VERSION \ --vision_tower2 '' \ --connector_type $CN_VERSION \ --mm_vision_select_layer -2 \ --image_aspect_ratio square \ --attn_implementation flash_attention_2 \ --fp16 True \ --training_recipe $TRAIN_RECIPE \ --tune_type_llm lora \ --tune_type_vision_tower frozen\ --tune_vision_tower_from_layer 0 \ --tune_type_connector full \ --group_by_modality_length True \ --pretrained_model_path /home/omnisky/userfile_2/wangfj/TinyLLaVA_Factory-main/checkpoints/llava_factory/tiny-llava-${LLM_VARIANT}-${VT_VARIANT}-${VERSION}-pretrain-0929 \ --output_dir /home/omnisky/userfile_2/wangfj/TinyLLaVA_Factory-main/checkpoints/llava_factory/tiny-llava-${LLM_VARIANT}-${VT_VARIANT}-${VERSION}-finetune-0929 \ --num_train_epochs 1 \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 4 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 50000 \ --save_total_limit 1 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 False \ --model_max_length $MODEL_MAX_LENGTH \ --gradient_checkpointing True \ --dataloader_num_workers 8 \ --lazy_preprocess True \ --report_to tensorboard \ --tokenizer_use_fast False \ --run_name /home/omnisky/userfile_2/wangfj/TinyLLaVA_Factory-main/checkpoints/llava_factory/tiny-llava-${LLM_VARIANT}-${VT_VARIANT}-${VERSION}-finetune-0929

YingHuTsing commented 6 days ago

Hi. After pretraining, the initial loss in finetune stage should starts from about 2.5. It seems the problem came from the pretraining stage. Please provide your params in pretrain.sh. Please also check if the final loss in your pretrain stage decreased to about 2.5.

wangfengjuan commented 4 days ago

Hi. After pretraining, the initial loss in finetune stage should starts from about 2.5. It seems the problem came from the pretraining stage. Please provide your params in pretrain.sh. Please also check if the final loss in your pretrain stage decreased to about 2.5.

Thank you for your reply. The pre-training parameter settings are as follows. The pre-training loss is also around 5, which has not decreased.

deepspeed --include localhost:0,1,2,3 --master_port 29502 tinyllava/train/train.py \ --deepspeed ./scripts/zero2.json \ --data_path $DATA_PATH\ --image_folder $IMAGE_PATH \ --is_multimodal True \ --conv_version pretrain \ --model_name_or_path $LLM_VERSION \ --vision_tower $VT_VERSION \ --vision_tower2 $VT_VERSION2 \ --connector_type $CN_VERSION \ --mm_vision_select_layer -2 \ --image_aspect_ratio square \ --attn_implementation flash_attention_2 \ --fp16 True \ --training_recipe $TRAIN_RECIPE \ --tune_type_llm frozen \ --tune_type_vision_tower frozen \ --tune_vision_tower_from_layer 0 \ --tune_type_connector full \ --output_dir /home/omnisky/userfile_2/wangfj/TinyLLaVA_Factory-main/checkpoints/llava_factory/tiny-llava-${LLM_VARIANT}-${VT_VARIANT}-${VERSION}-pretrain-0929 \ --num_train_epochs 1 \ --per_device_train_batch_size 32 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 2 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 24000 \ --save_total_limit 1 \ --learning_rate 1e-1 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 False \ --model_max_length $MODEL_MAX_LENGTH \ --gradient_checkpointing True \ --dataloader_num_workers 8 \ --lazy_preprocess True \ --report_to tensorboard \ --tokenizer_use_fast False \ --run_name /home/omnisky/userfile_2/wangfj/TinyLLaVA_Factory-main/checkpoints/llava_factory/tiny-llava-${LLM_VARIANT}-${VT_VARIANT}-${VERSION}-pretrain-0929

YingHuTsing commented 2 days ago

Hi, your learning rate in the pretrain stage is too large...please set learning_rate to 1e-3.

And are you sure per_device_train_batch_size can be set to 32? I run your scripts also with a machine of 4 3090GPUs. I need to decrease per_device_train_batch_size to 16 and increase gradient_accumulation_steps to 4, so that I can avoid OOM.

wangfengjuan commented 1 day ago

Thank you very much for your reply, per_device_train_batch_size be set 4 when running with a machine of 4 3090GPUs，Otherwise it will oom. I'll try again with a different learning rate. thankyou！

TinyLLaVA / TinyLLaVA_Factory

loss does not decrease #120