!/bin/bash

export CUDA_DEVICE_MAX_CONNECTIONS=1 DIR=pwd

export MODEL="/workspace/model_weight/internlm-xcomposer2-vl-7b" export DATA="data.txt"

GPUS_PER_NODE=8 NNODES=1 NODE_RANK=0 MASTER_ADDR=localhost MASTER_PORT=6001

DISTRIBUTED_ARGS=" --nproc_per_node $GPUS_PER_NODE \ --nnodes $NNODES \ --node_rank $NODE_RANK \ --master_addr $MASTER_ADDR \ --master_port $MASTER_PORT "

torchrun $DISTRIBUTED_ARGS finetune.py \ --model_name_or_path $MODEL \ --data_path $DATA \ --img_size 490 \ --given_num True \ --bf16 True \ --fix_vit True \ --fix_sampler False \ --use_lora False \ --output_dir output/test \ --num_train_epochs 3 \ --per_device_train_batch_size 2 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "epoch" \ --save_total_limit 1 \ --learning_rate 1e-5 \ --weight_decay 0.1 \ --adam_beta2 0.95 \ --warmup_ratio 0.01 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --report_to "none" \ --max_length 1024 \ --deepspeed ds_config_zero2.json \ --gradient_checkpointing True

训练数据，按指定格式构造

训练日志 {'loss': 0.0, 'learning_rate': 8.333333333333333e-07, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 1.6666666666666667e-06, 'epoch': 0.01}
{'loss': 0.0, 'learning_rate': 2.5e-06, 'epoch': 0.01}
{'loss': 0.0, 'learning_rate': 3.3333333333333333e-06, 'epoch': 0.01}
{'loss': 0.0, 'learning_rate': 4.166666666666667e-06, 'epoch': 0.01}
{'loss': 0.0, 'learning_rate': 5e-06, 'epoch': 0.02}
{'loss': 0.0, 'learning_rate': 5.833333333333334e-06, 'epoch': 0.02}
{'loss': 0.0, 'learning_rate': 6.666666666666667e-06, 'epoch': 0.02}
{'loss': 0.0, 'learning_rate': 7.500000000000001e-06, 'epoch': 0.02}
{'loss': 0.0, 'learning_rate': 8.333333333333334e-06, 'epoch': 0.03}
{'loss': 0.0, 'learning_rate': 9.166666666666666e-06, 'epoch': 0.03}
{'loss': 0.0, 'learning_rate': 1e-05, 'epoch': 0.03}
{'loss': 0.0, 'learning_rate': 9.999981599807402e-06, 'epoch': 0.03}
{'loss': 0.0, 'learning_rate': 9.99992639936503e-06, 'epoch': 0.04}

yuhangzang commented 6 months ago

This issue may be caused by this line, you may check your data format to avoid.

xienan0326 commented 6 months ago

This issue may be caused by this line, you may check your data format to avoid.

训练数据的格式，如何修改 { "id": "0", "image": [ "/workspace/VL-Data/images/wKjBzVc_ukiAa5D4AATqsCZTbKg837.jpg" ], "conversations": [ { "from": "user", "value": " 图中是什么？" }, { "from": "assistant", "value": "图中是NBA球星勒布朗.詹姆斯。" }, { "from": "user", "value": "介绍一下" }, { "from": "assistant", "value": "..." } ] }

xienan0326 commented 6 months ago

This issue may be caused by this line, you may check your data format to avoid.

Doesn’t it support multiple rounds of dialogue with one picture?

yuhangzang commented 6 months ago

Can you comment this line and these two lines, and re-try to see if this issue still exists?

xienan0326 commented 6 months ago

Can you comment this line and these two lines, and re-try to see if this issue still exists? still error

我把bitch_size改成1 训练数据改成： { "id": "0", "image": ["/workspace/VL-Data/images/wKjBzVc_ukiAa5D4AATqsCZTbKg837.jpg"], "conversations": [ { "from": "user", "value": " 图中是什么？" }, { "from": "assistant", "value": "图中是NBA球星勒布朗.詹姆斯。" } ] }

print text: ['[UNUSED_TOKEN_146]user\n 图中是什么？[UNUSED_TOKEN_145]\n[UNUSED_TOKEN_146]assistant\n图中是NBA球星勒布朗.詹姆斯。[UNUSED_TOKEN_145]\n'] print(len(batch['image'])) 1 print(len(batch['text_input'])) 1 print(batch['data_type']) ['multi']

WeiminLee commented 4 months ago

Can you comment this line and these two lines, and re-try to see if this issue still exists? still error

我把bitch_size改成1 训练数据改成： { "id": "0", "image": ["/workspace/VL-Data/images/wKjBzVc_ukiAa5D4AATqsCZTbKg837.jpg"], "conversations": [ { "from": "user", "value": " 图中是什么？" }, { "from": "assistant", "value": "图中是NBA球星勒布朗.詹姆斯。" } ] }

print text: ['[UNUSED_TOKEN_146]user\n 图中是什么？[UNUSED_TOKEN_145]\n[UNUSED_TOKEN_146]assistant\n图中是NBA球星勒布朗.詹姆斯。[UNUSED_TOKEN_145]\n'] print(len(batch['image'])) 1 print(len(batch['text_input'])) 1 print(batch['data_type']) ['multi']

value 中不需要加占位符吗？

nzomi commented 3 months ago

I encountered a similar issue before, which was resolved by using a larger max_length.

InternLM / InternLM-XComposer

loss一直是0 #213

!/bin/bash