Finetune issue - Githubissues

@Yuliang-Liu Nice work! I run into finetune issue as follows e20823216a167f50ed234a1468f8a51

2 GPUs of NVIDIA A800 80G is employed during training. The finetune script looks like CUDA_VISIBLE_DEVICES=$gpu torchrun \ --nnodes=1 \ --node_rank=0 \ --master_addr=0.0.0.0 \ --nproc_per_node=${GPUS} \ --master_port=${MASTER_PORT} \ internvl/train/internvl_chat_finetune.py \ --model_name_or_path "models/OpenGVLab/InternVL2-2B" \ --conv_style "internlm2-chat" \ --output_dir ${OUTPUT_DIR} \ --meta_path "shell/data/train-finetune.json" \ --overwrite_output_dir True \ --force_image_size 448 \ --max_dynamic_patch 6 \ --down_sample_ratio 0.5 \ --drop_path_rate 0.0 \ --freeze_llm True \ --freeze_mlp True \ --freeze_backbone True \ --use_llm_lora 16 \ --vision_select_layer -1 \ --dataloader_num_workers 4 \ --bf16 True \ --num_train_epochs 1 \ --per_device_train_batch_size ${PER_DEVICE_BATCH_SIZE} \ --gradient_accumulation_steps ${GRADIENT_ACC} \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 200 \ --save_total_limit 1 \ --learning_rate 4e-6 \ --weight_decay 0.01 \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --max_seq_length 4096 \ --do_train True \ --grad_checkpoint True \ --group_by_length True \ --dynamic_image_size True \ --use_thumbnail True \ --ps_version 'v2' \ --deepspeed "zero_stage1_config.json" \ --report_to "tensorboard"

Could you pls give me some clues to fix it? Thanks a lot

Yuliang-Liu / Monkey

Finetune issue #142