DAMO-NLP-SG / VideoLLaMA2

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Apache License 2.0
545 stars 33 forks source link

AttributeError: 'MistralConfig' object has no attribute 'attention_bias' while fine-tuning lora.sh #40

Closed deepakHonakeri05 closed 1 week ago

deepakHonakeri05 commented 1 week ago
`root@ad966f70d032:/workspace/upvllama/VideoLLaMA2#` sh scripts/custom/finetune_lora.sh
[2024-07-08 09:54:08,665] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-08 09:54:09,069] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-08 09:54:09,069] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[2024-07-08 09:54:09,437] [INFO] [partition_parameters.py:349:__exit__] finished initializing model - num_params = 1, num_elems = 0.13B
Traceback (most recent call last):
  File "/workspace/upvllama/VideoLLaMA2/videollama2/train_flash_attn.py", line 12, in <module>
    train(attn_implementation="flash_attention_2")
  File "/workspace/upvllama/VideoLLaMA2/./videollama2/train.py", line 716, in train
    model = Videollama2LlamaForCausalLM.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3550, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 509, in wrapper
    f(module, *args, **kwargs)
  File "/workspace/upvllama/VideoLLaMA2/./videollama2/model/language_model/videollama2_llama.py", line 46, in __init__
    self.model = Videollama2LlamaModel(config)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 509, in wrapper
    f(module, *args, **kwargs)
  File "/workspace/upvllama/VideoLLaMA2/./videollama2/model/language_model/videollama2_llama.py", line 38, in __init__
    super(Videollama2LlamaModel, self).__init__(config)
  File "/workspace/upvllama/VideoLLaMA2/./videollama2/model/videollama2_arch.py", line 33, in __init__
    super(Videollama2MetaModel, self).__init__(config)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 509, in wrapper
    f(module, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 927, in __init__
    [LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 927, in <listcomp>
    [LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 509, in wrapper
    f(module, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 700, in __init__
    self.self_attn = LLAMA_ATTENTION_CLASSES[config._attn_implementation](config=config, layer_idx=layer_idx)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 509, in wrapper
    f(module, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 413, in __init__
    super().__init__(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 509, in wrapper
    f(module, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 287, in __init__
    self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias)
  File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 263, in __getattribute__
    return super().__getattribute__(key)
AttributeError: 'MistralConfig' object has no attribute 'attention_bias'
[2024-07-08 09:54:12,633] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 17716) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
videollama2/train_flash_attn.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-08_09:54:12
  host      : ad966f70d032
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 17716)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

I'm currently using the lastest version of the code along with the required versions of python packages mentioned in README.
I'm fine-tuning on a custom dataset with lora and my modal path is DAMO-NLP-SG/VideoLLaMA2-7B-16F
clownrat6 commented 1 week ago

Could you please share your full scripts. This bug is usually caused by loading mistralai/Mistral-7B-Instruct-v0.2 via Videollama2LlamaForCausalLM.

deepakHonakeri05 commented 1 week ago

This is the fine-tuning script I'm using

`#!/bin/bash

Environment Variables

ARG_WORLD_SIZE=${1:-1} ARG_NPROC_PER_NODE=${2:-8} ARG_MASTER_ADDR="127.0.0.1" ARG_MASTER_PORT=16666 ARG_RANK=0

Multiple conditions

if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then WORLD_SIZE=$ARG_WORLD_SIZE NPROC_PER_NODE=$ARG_NPROC_PER_NODE fi if [ ! -n "$MASTER_ADDR" ] || [ ! -n "$MASTER_PORT" ] || [ ! -n "$RANK" ]; then MASTER_ADDR=$ARG_MASTER_ADDR MASTER_PORT=$ARG_MASTER_PORT RANK=$ARG_RANK fi

echo "WORLD_SIZE: $WORLD_SIZE" echo "NPROC_PER_NODE: $NPROC_PER_NODE"

Training Arguments

GLOBAL_BATCH_SIZE=128 LOCAL_BATCH_SIZE=4 GRADIENT_ACCUMULATION_STEPS=$[$GLOBAL_BATCH_SIZE/($WORLD_SIZE$NPROC_PER_NODE$LOCAL_BATCH_SIZE)]

Log Arguments

export TRANSFORMERS_OFFLINE=1 export WANDB_PROJECT=videollama2_vllava RUN_NAME=videollama2_vllava_lora DATA_DIR=datasets OUTP_DIR=work_dirs

torchrun --nnodes $WORLD_SIZE \ --nproc_per_node $NPROC_PER_NODE \ --master_addr=$MASTER_ADDR \ --master_port=$MASTER_PORT \ --node_rank $RANK \ videollama2/train_flash_attn.py \ --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \ --deepspeed scripts/zero3.json \ --version v1_mistral \ --vision_tower openai/clip-vit-large-patch14-336 \ --mm_projector_type stc_connector \ --model_name_or_path DAMO-NLP-SG/VideoLLaMA2-7B-16F \ --data_path ${DATA_DIR}/custom_sft/custom.json \ --data_folder ${DATA_DIR}/custom_sft/ \ --pretrain_mm_mlp_adapter DAMO-NLP-SG/VideoLLaMA2-7B-16F-Base/mm_projector.bin \ --mm_vision_select_layer -2 \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --image_aspect_ratio pad \ --num_frames 8 \ --bf16 True \ --tf32 True \ --fp16 False \ --output_dir ${OUTP_DIR}/${WANDBPROJECT}/finetune${RUN_NAME} \ --num_train_epochs 1 \ --per_device_train_batch_size $LOCAL_BATCH_SIZE \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 500 \ --save_total_limit 99 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --model_max_length 2048 \ --gradient_checkpointing True \ --dataloader_num_workers 4 \ --report_to tensorboard \ --run_name $RUN_NAME \ `

clownrat6 commented 1 week ago

It seems that you wish to continue fine-tuning the existing model. Please git pull origin main:main and use such a scripts command:

#!/bin/bash

# Environment Variables
ARG_WORLD_SIZE=${1:-1}
ARG_NPROC_PER_NODE=${2:-8}
ARG_MASTER_ADDR="127.0.0.1"
ARG_MASTER_PORT=16666
ARG_RANK=0

# Multiple conditions
if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then
    WORLD_SIZE=$ARG_WORLD_SIZE
    NPROC_PER_NODE=$ARG_NPROC_PER_NODE
fi
if [ ! -n "$MASTER_ADDR" ] || [ ! -n "$MASTER_PORT" ] || [ ! -n "$RANK" ]; then
    MASTER_ADDR=$ARG_MASTER_ADDR
    MASTER_PORT=$ARG_MASTER_PORT
    RANK=$ARG_RANK
fi

echo "WORLD_SIZE: $WORLD_SIZE"
echo "NPROC_PER_NODE: $NPROC_PER_NODE"

# Training Arguments
GLOBAL_BATCH_SIZE=128
LOCAL_BATCH_SIZE=4
GRADIENT_ACCUMULATION_STEPS=$[$GLOBAL_BATCH_SIZE/($WORLD_SIZE*$NPROC_PER_NODE*$LOCAL_BATCH_SIZE)]

# Log Arguments
export TRANSFORMERS_OFFLINE=1
export WANDB_PROJECT=videollama2_vllava
RUN_NAME=videollama2_vllava_lora_debug
DATA_DIR=datasets
OUTP_DIR=work_dirs

torchrun --nnodes $WORLD_SIZE \
    --nproc_per_node $NPROC_PER_NODE \
    --master_addr=$MASTER_ADDR \
    --master_port=$MASTER_PORT \
    --node_rank $RANK \
    videollama2/train_flash_attn.py \
    --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \
    --deepspeed scripts/zero3.json \
    --version v1_mistral \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --mm_projector_type stc_connector \
    --model_name_or_path DAMO-NLP-SG/VideoLLaMA2-7B-16F \
    --data_path   ${DATA_DIR}/videollava_sft/videochatgpt_llavaimage_tune.json \
    --data_folder ${DATA_DIR}/videollava_sft/ \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --num_frames 16 \
    --bf16 True \
    --tf32 True \
    --fp16 False \
    --output_dir ${OUTP_DIR}/${WANDB_PROJECT}/finetune_${RUN_NAME} \
    --num_train_epochs 1 \
    --per_device_train_batch_size $LOCAL_BATCH_SIZE \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 500 \
    --save_total_limit 99 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --report_to tensorboard \
    --run_name $RUN_NAME \
deepakHonakeri05 commented 1 week ago

Thank you very much for the updated script file. It worked!