Closed betterftr closed 4 months ago
I also encountered the same problem
This issue has been fixed by the commit a57c093. You can pull the latest version of fine-tune code to check if the problem still exists.
Hi, When I use 2 machines with 8xA100 for each, the loss still becomes 0:
{'loss': 1.7172, 'learning_rate': 7.874015748031497e-08, 'epoch': 0.0}
{'loss': 1.7636, 'learning_rate': 1.5748031496062994e-07, 'epoch': 0.0}
{'loss': 1.7636, 'learning_rate': 1.5748031496062994e-07, 'epoch': 0.0}
{'loss': 1.7367, 'learning_rate': 2.362204724409449e-07, 'epoch': 0.0}
{'loss': 1.7367, 'learning_rate': 2.362204724409449e-07, 'epoch': 0.0}
{'loss': 1.7638, 'learning_rate': 3.149606299212599e-07, 'epoch': 0.0}
{'loss': 1.7638, 'learning_rate': 3.149606299212599e-07, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 3.937007874015748e-07, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 3.937007874015748e-07, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 4.724409448818898e-07, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 4.724409448818898e-07, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 5.511811023622048e-07, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 5.511811023622048e-07, 'epoch': 0.0}
I use this to train:
#!/bin/bash
export NCCL_DEBUG=INFO
export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
export NCCL_SOCKET_IFNAME=eth0
DIR=`pwd`
GPUS_PER_NODE=$(python -c 'import torch; print(torch.cuda.device_count())')
DISTRIBUTED_ARGS="
--nproc_per_node $GPUS_PER_NODE \
--nnodes $WORLD_SIZE \
--node_rank $RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT
"
torchrun $DISTRIBUTED_ARGS finetune.py \
--model_name_or_path ./internlm-xcomposer2-vl-7b \
--data_path $DATA \
--img_size 490 \
--bf16 True \
--fix_vit False \
--fix_sampler False \
--use_lora False \
--output_dir output/xxx \
--num_train_epochs 1 \
--batch_size 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 4 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 500 \
--save_total_limit 15 \
--learning_rate 1e-5 \
--weight_decay 0.1 \
--adam_beta2 0.95 \
--warmup_ratio 0.01 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--report_to "none" \
--max_length 4096 \
--deepspeed ds_config_zero2.json \
--gradient_checkpointing True
@yuhangzang @myownskyW7 Looking forward to your reply, thank you!
This is a bug related to deepspeed zero2, you can refer to this issue: https://github.com/haotian-liu/LLaVA/issues/1231
When I'm training internlm-xcomposer2-7b:![image](https://github.com/InternLM/InternLM-XComposer/assets/84087448/7d1965d0-ede2-49c6-bebe-a6e4b7532abc)
When I'm training internlm-xcomposer2-vl-7b:![image](https://github.com/InternLM/InternLM-XComposer/assets/84087448/889f213e-6660-4694-9634-3e0f05b1843e)