Closed deepakHonakeri05 closed 1 week ago
Could you please share your full scripts. This bug is usually caused by loading mistralai/Mistral-7B-Instruct-v0.2
via Videollama2LlamaForCausalLM
.
This is the fine-tuning script I'm using
`#!/bin/bash
ARG_WORLD_SIZE=${1:-1} ARG_NPROC_PER_NODE=${2:-8} ARG_MASTER_ADDR="127.0.0.1" ARG_MASTER_PORT=16666 ARG_RANK=0
if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then WORLD_SIZE=$ARG_WORLD_SIZE NPROC_PER_NODE=$ARG_NPROC_PER_NODE fi if [ ! -n "$MASTER_ADDR" ] || [ ! -n "$MASTER_PORT" ] || [ ! -n "$RANK" ]; then MASTER_ADDR=$ARG_MASTER_ADDR MASTER_PORT=$ARG_MASTER_PORT RANK=$ARG_RANK fi
echo "WORLD_SIZE: $WORLD_SIZE" echo "NPROC_PER_NODE: $NPROC_PER_NODE"
GLOBAL_BATCH_SIZE=128 LOCAL_BATCH_SIZE=4 GRADIENT_ACCUMULATION_STEPS=$[$GLOBAL_BATCH_SIZE/($WORLD_SIZE$NPROC_PER_NODE$LOCAL_BATCH_SIZE)]
export TRANSFORMERS_OFFLINE=1 export WANDB_PROJECT=videollama2_vllava RUN_NAME=videollama2_vllava_lora DATA_DIR=datasets OUTP_DIR=work_dirs
torchrun --nnodes $WORLD_SIZE \ --nproc_per_node $NPROC_PER_NODE \ --master_addr=$MASTER_ADDR \ --master_port=$MASTER_PORT \ --node_rank $RANK \ videollama2/train_flash_attn.py \ --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \ --deepspeed scripts/zero3.json \ --version v1_mistral \ --vision_tower openai/clip-vit-large-patch14-336 \ --mm_projector_type stc_connector \ --model_name_or_path DAMO-NLP-SG/VideoLLaMA2-7B-16F \ --data_path ${DATA_DIR}/custom_sft/custom.json \ --data_folder ${DATA_DIR}/custom_sft/ \ --pretrain_mm_mlp_adapter DAMO-NLP-SG/VideoLLaMA2-7B-16F-Base/mm_projector.bin \ --mm_vision_select_layer -2 \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --image_aspect_ratio pad \ --num_frames 8 \ --bf16 True \ --tf32 True \ --fp16 False \ --output_dir ${OUTP_DIR}/${WANDBPROJECT}/finetune${RUN_NAME} \ --num_train_epochs 1 \ --per_device_train_batch_size $LOCAL_BATCH_SIZE \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 500 \ --save_total_limit 99 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --model_max_length 2048 \ --gradient_checkpointing True \ --dataloader_num_workers 4 \ --report_to tensorboard \ --run_name $RUN_NAME \ `
It seems that you wish to continue fine-tuning the existing model. Please git pull origin main:main
and use such a scripts command:
#!/bin/bash
# Environment Variables
ARG_WORLD_SIZE=${1:-1}
ARG_NPROC_PER_NODE=${2:-8}
ARG_MASTER_ADDR="127.0.0.1"
ARG_MASTER_PORT=16666
ARG_RANK=0
# Multiple conditions
if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then
WORLD_SIZE=$ARG_WORLD_SIZE
NPROC_PER_NODE=$ARG_NPROC_PER_NODE
fi
if [ ! -n "$MASTER_ADDR" ] || [ ! -n "$MASTER_PORT" ] || [ ! -n "$RANK" ]; then
MASTER_ADDR=$ARG_MASTER_ADDR
MASTER_PORT=$ARG_MASTER_PORT
RANK=$ARG_RANK
fi
echo "WORLD_SIZE: $WORLD_SIZE"
echo "NPROC_PER_NODE: $NPROC_PER_NODE"
# Training Arguments
GLOBAL_BATCH_SIZE=128
LOCAL_BATCH_SIZE=4
GRADIENT_ACCUMULATION_STEPS=$[$GLOBAL_BATCH_SIZE/($WORLD_SIZE*$NPROC_PER_NODE*$LOCAL_BATCH_SIZE)]
# Log Arguments
export TRANSFORMERS_OFFLINE=1
export WANDB_PROJECT=videollama2_vllava
RUN_NAME=videollama2_vllava_lora_debug
DATA_DIR=datasets
OUTP_DIR=work_dirs
torchrun --nnodes $WORLD_SIZE \
--nproc_per_node $NPROC_PER_NODE \
--master_addr=$MASTER_ADDR \
--master_port=$MASTER_PORT \
--node_rank $RANK \
videollama2/train_flash_attn.py \
--lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \
--deepspeed scripts/zero3.json \
--version v1_mistral \
--vision_tower openai/clip-vit-large-patch14-336 \
--mm_projector_type stc_connector \
--model_name_or_path DAMO-NLP-SG/VideoLLaMA2-7B-16F \
--data_path ${DATA_DIR}/videollava_sft/videochatgpt_llavaimage_tune.json \
--data_folder ${DATA_DIR}/videollava_sft/ \
--mm_vision_select_layer -2 \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--image_aspect_ratio pad \
--num_frames 16 \
--bf16 True \
--tf32 True \
--fp16 False \
--output_dir ${OUTP_DIR}/${WANDB_PROJECT}/finetune_${RUN_NAME} \
--num_train_epochs 1 \
--per_device_train_batch_size $LOCAL_BATCH_SIZE \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 500 \
--save_total_limit 99 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--model_max_length 2048 \
--gradient_checkpointing True \
--dataloader_num_workers 4 \
--report_to tensorboard \
--run_name $RUN_NAME \
Thank you very much for the updated script file. It worked!