Open white2018 opened 2 months ago
Hi~ The default is to run with eight GPUs. If you use two GPUs, you need to set --nproc_per_node to 2.
Hi~ The default is to run with eight GPUs. If you use two GPUs, you need to set --nproc_per_node to 2.
The training commands looks like: ++ CUDA_VISIBLE_DEVICES=0,1 ++ torchrun --nnodes=1 --node_rank=0 --master_addr=0.0.0.0 --nproc_per_node=2 --master_port=20135 internvl/train/internvl_chat_finetune.py --model_name_or_path models/OpenGVLab/InternVL2-2B --conv_style internlm2-chat --output_dir minimonkey_chat_lora --meta_path shell/data/train-finetune.json --overwrite_output_dir True --force_image_size 448 --max_dynamic_patch 6 --down_sample_ratio 0.5 --drop_path_rate 0.0 --freeze_llm True --freeze_mlp True --freeze_backbone True --use_llm_lora 16 --vision_select_layer -1 --dataloader_num_workers 4 --bf16 True --num_train_epochs 1 --per_device_train_batch_size 2 --gradient_accumulation_steps 4 --evaluation_strategy no --save_strategy steps --save_steps 200 --save_total_limit 1 --learning_rate 4e-6 --weight_decay 0.01 --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --max_seq_length 4096 --do_train True --grad_checkpoint True --group_by_length True --dynamic_image_size True --use_thumbnail True --ps_version v2 --deepspeed zero_stage1_config.json --report_to tensorboard
which leads to core, the snapshot is as follow:
Do you encounter this issue when using zero_stage3_config.json?
@Yuliang-Liu Nice work! I run into finetune issue as follows
2 GPUs of NVIDIA A800 80G is employed during training. The finetune script looks like CUDA_VISIBLE_DEVICES=$gpu torchrun \ --nnodes=1 \ --node_rank=0 \ --master_addr=0.0.0.0 \ --nproc_per_node=${GPUS} \ --master_port=${MASTER_PORT} \ internvl/train/internvl_chat_finetune.py \ --model_name_or_path "models/OpenGVLab/InternVL2-2B" \ --conv_style "internlm2-chat" \ --output_dir ${OUTPUT_DIR} \ --meta_path "shell/data/train-finetune.json" \ --overwrite_output_dir True \ --force_image_size 448 \ --max_dynamic_patch 6 \ --down_sample_ratio 0.5 \ --drop_path_rate 0.0 \ --freeze_llm True \ --freeze_mlp True \ --freeze_backbone True \ --use_llm_lora 16 \ --vision_select_layer -1 \ --dataloader_num_workers 4 \ --bf16 True \ --num_train_epochs 1 \ --per_device_train_batch_size ${PER_DEVICE_BATCH_SIZE} \ --gradient_accumulation_steps ${GRADIENT_ACC} \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 200 \ --save_total_limit 1 \ --learning_rate 4e-6 \ --weight_decay 0.01 \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --max_seq_length 4096 \ --do_train True \ --grad_checkpoint True \ --group_by_length True \ --dynamic_image_size True \ --use_thumbnail True \ --ps_version 'v2' \ --deepspeed "zero_stage1_config.json" \ --report_to "tensorboard"
Could you pls give me some clues to fix it? Thanks a lot