DAMO-NLP-SG / VideoLLaMA2

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Apache License 2.0
871 stars 60 forks source link

how to finetune Videollama2 chat models using QLoRA and LoRA. #58

Open thisurawz1 opened 3 months ago

thisurawz1 commented 3 months ago

how to finetune Videollama2 chat models using QLoRA and LoRA.

... --data_path datasets/custom_sft/custom.json --data_folder datasets/custom_sft/ --pretrain_mm_mlp_adapter CONNECTOR_DOWNLOAD_PATH (e.g., DAMO-NLP-SG/VideoLLaMA2-7B-Base) ...

here you have mentioned only the base models.

sjghh commented 1 month ago

Has this been implemented already? Thank you for your response

thisurawz1 commented 1 month ago

in the qlora script you can change the pretrain adapter path to the chat version as following. moreover you also have to include your hugging face read token. to download the mistral model.

#!/bin/bash

from huggingface_hub import login
login(token="your  read token")

# Environment Variables
ARG_WORLD_SIZE=${1:-1}
ARG_NPROC_PER_NODE=${2:-8} # change this -8 according to your gpus. here they use 8 gpus that's why its 8, if you're using only  3 gpus it should be -3.
ARG_MASTER_ADDR="127.0.0.1"
ARG_MASTER_PORT=16666
ARG_RANK=0

# Multiple conditions
if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then
    WORLD_SIZE=$ARG_WORLD_SIZE
    NPROC_PER_NODE=$ARG_NPROC_PER_NODE
fi
if [ ! -n "$MASTER_ADDR" ] || [ ! -n "$MASTER_PORT" ] || [ ! -n "$RANK" ]; then
    MASTER_ADDR=$ARG_MASTER_ADDR
    MASTER_PORT=$ARG_MASTER_PORT
    RANK=$ARG_RANK
fi

echo "WORLD_SIZE: $WORLD_SIZE"
echo "NPROC_PER_NODE: $NPROC_PER_NODE"

# Training Arguments
GLOBAL_BATCH_SIZE=128
LOCAL_BATCH_SIZE=4
GRADIENT_ACCUMULATION_STEPS=$[$GLOBAL_BATCH_SIZE/($WORLD_SIZE*$NPROC_PER_NODE*$LOCAL_BATCH_SIZE)]

# Log Arguments
export TRANSFORMERS_OFFLINE=1 # here you can make it 0 if you're getting errors 
export WANDB_PROJECT=videollama2
RUN_NAME=downstream_sft_settings_qlora
DATA_DIR=datasets
OUTP_DIR=work_dirs

torchrun --nnodes $WORLD_SIZE \
    --nproc_per_node $NPROC_PER_NODE \
    --master_addr=$MASTER_ADDR \
    --master_port=$MASTER_PORT \
    --node_rank $RANK \
    videollama2/train_flash_attn.py \
    --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 --bits 4 \
    --deepspeed scripts/zero2.json \
    --model_type videollama2 \
    --model_path mistralai/Mistral-7B-Instruct-v0.2 \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --mm_projector_type stc_connector \
    --pretrain_mm_mlp_adapter DAMO-NLP-SG/VideoLLaMA2-7B\ # this row should be changed like this
    --data_path   ${DATA_DIR}/videollava_sft/videochatgpt_llavaimage_tune.json \
    --data_folder ${DATA_DIR}/videollava_sft/ \
    --mm_vision_select_layer -2 \
    --image_aspect_ratio pad \
    --num_frames 8 \
    --bf16 True \
    --tf32 True \
    --fp16 False \
    --output_dir ${OUTP_DIR}/${WANDB_PROJECT}/finetune_${RUN_NAME} \
    --num_train_epochs 1 \
    --per_device_train_batch_size $LOCAL_BATCH_SIZE \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 500 \
    --save_total_limit 99 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --report_to tensorboard \
    --run_name $RUN_NAME \
jhw0510 commented 3 weeks ago

in the qlora script you can change the pretrain adapter path to the chat version as following. moreover you also have to include your hugging face read token. to download the mistral model.

#!/bin/bash

from huggingface_hub import login
login(token="your  read token")

# Environment Variables
ARG_WORLD_SIZE=${1:-1}
ARG_NPROC_PER_NODE=${2:-8} # change this -8 according to your gpus. here they use 8 gpus that's why its 8, if you're using only  3 gpus it should be -3.
ARG_MASTER_ADDR="127.0.0.1"
ARG_MASTER_PORT=16666
ARG_RANK=0

# Multiple conditions
if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then
    WORLD_SIZE=$ARG_WORLD_SIZE
    NPROC_PER_NODE=$ARG_NPROC_PER_NODE
fi
if [ ! -n "$MASTER_ADDR" ] || [ ! -n "$MASTER_PORT" ] || [ ! -n "$RANK" ]; then
    MASTER_ADDR=$ARG_MASTER_ADDR
    MASTER_PORT=$ARG_MASTER_PORT
    RANK=$ARG_RANK
fi

echo "WORLD_SIZE: $WORLD_SIZE"
echo "NPROC_PER_NODE: $NPROC_PER_NODE"

# Training Arguments
GLOBAL_BATCH_SIZE=128
LOCAL_BATCH_SIZE=4
GRADIENT_ACCUMULATION_STEPS=$[$GLOBAL_BATCH_SIZE/($WORLD_SIZE*$NPROC_PER_NODE*$LOCAL_BATCH_SIZE)]

# Log Arguments
export TRANSFORMERS_OFFLINE=1 # here you can make it 0 if you're getting errors 
export WANDB_PROJECT=videollama2
RUN_NAME=downstream_sft_settings_qlora
DATA_DIR=datasets
OUTP_DIR=work_dirs

torchrun --nnodes $WORLD_SIZE \
    --nproc_per_node $NPROC_PER_NODE \
    --master_addr=$MASTER_ADDR \
    --master_port=$MASTER_PORT \
    --node_rank $RANK \
    videollama2/train_flash_attn.py \
    --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 --bits 4 \
    --deepspeed scripts/zero2.json \
    --model_type videollama2 \
    --model_path mistralai/Mistral-7B-Instruct-v0.2 \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --mm_projector_type stc_connector \
    --pretrain_mm_mlp_adapter DAMO-NLP-SG/VideoLLaMA2-7B\ # this row should be changed like this
    --data_path   ${DATA_DIR}/videollava_sft/videochatgpt_llavaimage_tune.json \
    --data_folder ${DATA_DIR}/videollava_sft/ \
    --mm_vision_select_layer -2 \
    --image_aspect_ratio pad \
    --num_frames 8 \
    --bf16 True \
    --tf32 True \
    --fp16 False \
    --output_dir ${OUTP_DIR}/${WANDB_PROJECT}/finetune_${RUN_NAME} \
    --num_train_epochs 1 \
    --per_device_train_batch_size $LOCAL_BATCH_SIZE \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 500 \
    --save_total_limit 99 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --report_to tensorboard \
    --run_name $RUN_NAME \

I will get an error on this line of code --pretrain_mm_mlp_adapter DAMO-NLP-SG/VideoLLaMA2-7B\ # this row should be changed like this, and the error is as follows: Traceback (most recent call last): File "/data/hao/ChatGLM/VideoLLaMA2-main/videollama2/train.py", line 575, in train() File "/data/hao/ChatGLM/VideoLLaMA2-main/videollama2/train.py", line 496, in train model.get_model().initialize_vision_modules(model_args=model_args, fsdp=training_args.fsdp) File "/data/hao/ChatGLM/VideoLLaMA2-main/./videollama2/model/videollama2_arch.py", line 82, in initialize_vision_modules mm_projector_weights = load_mm_projector(pretrain_mm_mlp_adapter) File "/data/hao/ChatGLM/VideoLLaMA2-main/./videollama2/model/projector.py", line 59, in load_mm_projector snapshot_download(repo_id=model_path, cache_dir=cache_dir, token=token) File "/data/hao/anaconda3/envs/videollama2/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn validate_repo_id(arg_value) File "/data/hao/anaconda3/envs/videollama2/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id raise HFValidationError( huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/data/hao/ChatGLM/VideoLLaMA2-main/DAMO-NLP-SG/VideoLLaMA2.1-7B-16F/'. Use repo_type argument if needed.

I looked at the source code and found that it will look for 'mm_projector.bin'. The specific code is here: `def load_mm_projector(model_path, cache_dir=None, token=None): if os.path.exists(os.path.join(model_path, 'mm_projector.bin')): is_local = True folder = model_path else: is_local = False folder = parse_snapshot_folder(model_path, cache_dir=cache_dir, repo_type="model") if not os.path.exists(os.path.join(folder, 'mm_projector.bin')):

downloading from remote repo

        from huggingface_hub import snapshot_download
        snapshot_download(repo_id=model_path, cache_dir=cache_dir, token=token)`

However, there is no 'mm_projector.bin' in DAMO-NLP-SG/VideoLLaMA2-7B

I copied the 'mm_projector.bin' from DAMO-NLP-SG/VideoLLaMA2-7B-Base and it seems to work. I guess it's because the Chat model doesn't change the parameters of mm_projector.bin.

But I encountered a new problem. Traceback (most recent call last): File "/data/hao/ChatGLM/VideoLLaMA2-main/inference_test.py", line 18, in model, processor, tokenizer = model_init(model_path) File "/data/hao/ChatGLM/VideoLLaMA2-main/videollama2/init.py", line 17, in model_init tokenizer, model, processor, context_len = load_pretrained_model(model_path, None, model_name, **kwargs) File "/data/hao/ChatGLM/VideoLLaMA2-main/videollama2/model/init.py", line 165, in load_pretrained_model tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False, token=token) File "/data/hao/anaconda3/envs/videollama2/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 878, in from_pretrained tokenizer_class_py, tokenizer_class_fast = TOKENIZER_MAPPING[type(config)] File "/data/hao/anaconda3/envs/videollama2/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 731, in getitem model_type = self._reverse_config_mapping[key.name] KeyError: 'Videollama2Qwen2Config'

ZHANGH83 commented 1 week ago

@jhw0510 I'm also confused, the chat version do not have 'mm_projector.bin' file and it seems that we can not finetune the chat version. If using the copy the '.bin' file from the base version, isn't the model finetuned on the base version? Do you have any updates, please?

LiangMeng89 commented 2 days ago

@jhw0510 I'm also confused, the chat version do not have 'mm_projector.bin' file and it seems that we can not finetune the chat version. If using the copy the '.bin' file from the base version, isn't the model finetuned on the base version? Do you have any updates, please?

Hello,I'm a phD student from ZJU, I also use videollama2 to do my own research,we create a WeChat group to discuss some issues of videollama2 and help each other,could you join us? Please contact me: WeChat number == LiangMeng19357260600, phone number == +86 19357260600,e-mail == liangmeng89@zju.edu.cn.

LiangMeng89 commented 2 days ago

in the qlora script you can change the pretrain adapter path to the chat version as following. moreover you also have to include your hugging face read token. to download the mistral model.

#!/bin/bash

from huggingface_hub import login
login(token="your  read token")

# Environment Variables
ARG_WORLD_SIZE=${1:-1}
ARG_NPROC_PER_NODE=${2:-8} # change this -8 according to your gpus. here they use 8 gpus that's why its 8, if you're using only  3 gpus it should be -3.
ARG_MASTER_ADDR="127.0.0.1"
ARG_MASTER_PORT=16666
ARG_RANK=0

# Multiple conditions
if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then
    WORLD_SIZE=$ARG_WORLD_SIZE
    NPROC_PER_NODE=$ARG_NPROC_PER_NODE
fi
if [ ! -n "$MASTER_ADDR" ] || [ ! -n "$MASTER_PORT" ] || [ ! -n "$RANK" ]; then
    MASTER_ADDR=$ARG_MASTER_ADDR
    MASTER_PORT=$ARG_MASTER_PORT
    RANK=$ARG_RANK
fi

echo "WORLD_SIZE: $WORLD_SIZE"
echo "NPROC_PER_NODE: $NPROC_PER_NODE"

# Training Arguments
GLOBAL_BATCH_SIZE=128
LOCAL_BATCH_SIZE=4
GRADIENT_ACCUMULATION_STEPS=$[$GLOBAL_BATCH_SIZE/($WORLD_SIZE*$NPROC_PER_NODE*$LOCAL_BATCH_SIZE)]

# Log Arguments
export TRANSFORMERS_OFFLINE=1 # here you can make it 0 if you're getting errors 
export WANDB_PROJECT=videollama2
RUN_NAME=downstream_sft_settings_qlora
DATA_DIR=datasets
OUTP_DIR=work_dirs

torchrun --nnodes $WORLD_SIZE \
    --nproc_per_node $NPROC_PER_NODE \
    --master_addr=$MASTER_ADDR \
    --master_port=$MASTER_PORT \
    --node_rank $RANK \
    videollama2/train_flash_attn.py \
    --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 --bits 4 \
    --deepspeed scripts/zero2.json \
    --model_type videollama2 \
    --model_path mistralai/Mistral-7B-Instruct-v0.2 \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --mm_projector_type stc_connector \
    --pretrain_mm_mlp_adapter DAMO-NLP-SG/VideoLLaMA2-7B\ # this row should be changed like this
    --data_path   ${DATA_DIR}/videollava_sft/videochatgpt_llavaimage_tune.json \
    --data_folder ${DATA_DIR}/videollava_sft/ \
    --mm_vision_select_layer -2 \
    --image_aspect_ratio pad \
    --num_frames 8 \
    --bf16 True \
    --tf32 True \
    --fp16 False \
    --output_dir ${OUTP_DIR}/${WANDB_PROJECT}/finetune_${RUN_NAME} \
    --num_train_epochs 1 \
    --per_device_train_batch_size $LOCAL_BATCH_SIZE \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 500 \
    --save_total_limit 99 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --report_to tensorboard \
    --run_name $RUN_NAME \

I will get an error on this line of code --pretrain_mm_mlp_adapter DAMO-NLP-SG/VideoLLaMA2-7B\ # this row should be changed like this, and the error is as follows: Traceback (most recent call last): File "/data/hao/ChatGLM/VideoLLaMA2-main/videollama2/train.py", line 575, in train() File "/data/hao/ChatGLM/VideoLLaMA2-main/videollama2/train.py", line 496, in train model.get_model().initialize_vision_modules(model_args=model_args, fsdp=training_args.fsdp) File "/data/hao/ChatGLM/VideoLLaMA2-main/./videollama2/model/videollama2_arch.py", line 82, in initialize_vision_modules mm_projector_weights = load_mm_projector(pretrain_mm_mlp_adapter) File "/data/hao/ChatGLM/VideoLLaMA2-main/./videollama2/model/projector.py", line 59, in load_mm_projector snapshot_download(repo_id=model_path, cache_dir=cache_dir, token=token) File "/data/hao/anaconda3/envs/videollama2/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn validate_repo_id(arg_value) File "/data/hao/anaconda3/envs/videollama2/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id raise HFValidationError( huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/data/hao/ChatGLM/VideoLLaMA2-main/DAMO-NLP-SG/VideoLLaMA2.1-7B-16F/'. Use repo_type argument if needed.

I looked at the source code and found that it will look for 'mm_projector.bin'. The specific code is here: def load_mm_projector(model_path, cache_dir=None, token=None): if os.path.exists(os.path.join(model_path, 'mm_projector.bin')): is_local = True folder = model_path else: is_local = False folder = parse_snapshot_folder(model_path, cache_dir=cache_dir, repo_type="model") if not os.path.exists(os.path.join(folder, 'mm_projector.bin')): # downloading from remote repo from huggingface_hub import snapshot_download snapshot_download(repo_id=model_path, cache_dir=cache_dir, token=token)

However, there is no 'mm_projector.bin' in DAMO-NLP-SG/VideoLLaMA2-7B

I copied the 'mm_projector.bin' from DAMO-NLP-SG/VideoLLaMA2-7B-Base and it seems to work. I guess it's because the Chat model doesn't change the parameters of mm_projector.bin.

But I encountered a new problem. Traceback (most recent call last): File "/data/hao/ChatGLM/VideoLLaMA2-main/inference_test.py", line 18, in model, processor, tokenizer = model_init(model_path) File "/data/hao/ChatGLM/VideoLLaMA2-main/videollama2/init.py", line 17, in model_init tokenizer, model, processor, context_len = load_pretrained_model(model_path, None, model_name, kwargs) File "/data/hao/ChatGLM/VideoLLaMA2-main/videollama2/model/init.py", line 165, in load_pretrained_model tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False, token=token) File "/data/hao/anaconda3/envs/videollama2/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 878, in from_pretrained tokenizer_class_py, tokenizer_class_fast = TOKENIZER_MAPPING[type(config)] File "/data/hao/anaconda3/envs/videollama2/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 731, in getitem model_type = self._reverse_config_mapping[key.name**] KeyError: 'Videollama2Qwen2Config'

Hello,I'm a phD student from ZJU, I also use videollama2 to do my own research,we create a WeChat group to discuss some issues of videollama2 and help each other,could you join us? Please contact me: WeChat number == LiangMeng19357260600, phone number == +86 19357260600,e-mail == liangmeng89@zju.edu.cn.

jhw0510 commented 11 hours ago

@jhw0510 I'm also confused, the chat version do not have 'mm_projector.bin' file and it seems that we can not finetune the chat version. If using the copy the '.bin' file from the base version, isn't the model finetuned on the base version? Do you have any updates, please?

是的,我也很奇怪,然后当时做完测试还遇到了新的问题,忘了是啥了,然后勉强训练成的东西还是乱码,估计是投影层和模型不匹配吧,然后我就换了个框架,我现在在用llava-next-video

jhw0510 commented 11 hours ago

in the qlora script you can change the pretrain adapter path to the chat version as following. moreover you also have to include your hugging face read token. to download the mistral model.

#!/bin/bash

from huggingface_hub import login
login(token="your  read token")

# Environment Variables
ARG_WORLD_SIZE=${1:-1}
ARG_NPROC_PER_NODE=${2:-8} # change this -8 according to your gpus. here they use 8 gpus that's why its 8, if you're using only  3 gpus it should be -3.
ARG_MASTER_ADDR="127.0.0.1"
ARG_MASTER_PORT=16666
ARG_RANK=0

# Multiple conditions
if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then
    WORLD_SIZE=$ARG_WORLD_SIZE
    NPROC_PER_NODE=$ARG_NPROC_PER_NODE
fi
if [ ! -n "$MASTER_ADDR" ] || [ ! -n "$MASTER_PORT" ] || [ ! -n "$RANK" ]; then
    MASTER_ADDR=$ARG_MASTER_ADDR
    MASTER_PORT=$ARG_MASTER_PORT
    RANK=$ARG_RANK
fi

echo "WORLD_SIZE: $WORLD_SIZE"
echo "NPROC_PER_NODE: $NPROC_PER_NODE"

# Training Arguments
GLOBAL_BATCH_SIZE=128
LOCAL_BATCH_SIZE=4
GRADIENT_ACCUMULATION_STEPS=$[$GLOBAL_BATCH_SIZE/($WORLD_SIZE*$NPROC_PER_NODE*$LOCAL_BATCH_SIZE)]

# Log Arguments
export TRANSFORMERS_OFFLINE=1 # here you can make it 0 if you're getting errors 
export WANDB_PROJECT=videollama2
RUN_NAME=downstream_sft_settings_qlora
DATA_DIR=datasets
OUTP_DIR=work_dirs

torchrun --nnodes $WORLD_SIZE \
    --nproc_per_node $NPROC_PER_NODE \
    --master_addr=$MASTER_ADDR \
    --master_port=$MASTER_PORT \
    --node_rank $RANK \
    videollama2/train_flash_attn.py \
    --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 --bits 4 \
    --deepspeed scripts/zero2.json \
    --model_type videollama2 \
    --model_path mistralai/Mistral-7B-Instruct-v0.2 \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --mm_projector_type stc_connector \
    --pretrain_mm_mlp_adapter DAMO-NLP-SG/VideoLLaMA2-7B\ # this row should be changed like this
    --data_path   ${DATA_DIR}/videollava_sft/videochatgpt_llavaimage_tune.json \
    --data_folder ${DATA_DIR}/videollava_sft/ \
    --mm_vision_select_layer -2 \
    --image_aspect_ratio pad \
    --num_frames 8 \
    --bf16 True \
    --tf32 True \
    --fp16 False \
    --output_dir ${OUTP_DIR}/${WANDB_PROJECT}/finetune_${RUN_NAME} \
    --num_train_epochs 1 \
    --per_device_train_batch_size $LOCAL_BATCH_SIZE \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 500 \
    --save_total_limit 99 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --report_to tensorboard \
    --run_name $RUN_NAME \

I will get an error on this line of code --pretrain_mm_mlp_adapter DAMO-NLP-SG/VideoLLaMA2-7B\ # this row should be changed like this, and the error is as follows: Traceback (most recent call last): File "/data/hao/ChatGLM/VideoLLaMA2-main/videollama2/train.py", line 575, in train() File "/data/hao/ChatGLM/VideoLLaMA2-main/videollama2/train.py", line 496, in train model.get_model().initialize_vision_modules(model_args=model_args, fsdp=training_args.fsdp) File "/data/hao/ChatGLM/VideoLLaMA2-main/./videollama2/model/videollama2_arch.py", line 82, in initialize_vision_modules mm_projector_weights = load_mm_projector(pretrain_mm_mlp_adapter) File "/data/hao/ChatGLM/VideoLLaMA2-main/./videollama2/model/projector.py", line 59, in load_mm_projector snapshot_download(repo_id=model_path, cache_dir=cache_dir, token=token) File "/data/hao/anaconda3/envs/videollama2/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn validate_repo_id(arg_value) File "/data/hao/anaconda3/envs/videollama2/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id raise HFValidationError( huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/data/hao/ChatGLM/VideoLLaMA2-main/DAMO-NLP-SG/VideoLLaMA2.1-7B-16F/'. Use repo_type argument if needed. I looked at the source code and found that it will look for 'mm_projector.bin'. The specific code is here: def load_mm_projector(model_path, cache_dir=None, token=None): if os.path.exists(os.path.join(model_path, 'mm_projector.bin')): is_local = True folder = model_path else: is_local = False folder = parse_snapshot_folder(model_path, cache_dir=cache_dir, repo_type="model") if not os.path.exists(os.path.join(folder, 'mm_projector.bin')): # downloading from remote repo from huggingface_hub import snapshot_download snapshot_download(repo_id=model_path, cache_dir=cache_dir, token=token) However, there is no 'mm_projector.bin' in DAMO-NLP-SG/VideoLLaMA2-7B I copied the 'mm_projector.bin' from DAMO-NLP-SG/VideoLLaMA2-7B-Base and it seems to work. I guess it's because the Chat model doesn't change the parameters of mm_projector.bin. But I encountered a new problem. Traceback (most recent call last): File "/data/hao/ChatGLM/VideoLLaMA2-main/inference_test.py", line 18, in model, processor, tokenizer = model_init(model_path) File "/data/hao/ChatGLM/VideoLLaMA2-main/videollama2/init.py", line 17, in model_init tokenizer, model, processor, context_len = load_pretrained_model(model_path, None, model_name, kwargs) File "/data/hao/ChatGLM/VideoLLaMA2-main/videollama2/model/init.py", line 165, in load_pretrained_model tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False, token=token) File "/data/hao/anaconda3/envs/videollama2/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 878, in from_pretrained tokenizer_class_py, tokenizer_class_fast = TOKENIZER_MAPPING[type(config)] File "/data/hao/anaconda3/envs/videollama2/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 731, in getitem model_type = self._reverse_config_mapping[key.name**] KeyError: 'Videollama2Qwen2Config'

Hello,I'm a phD student from ZJU, I also use videollama2 to do my own research,we create a WeChat group to discuss some issues of videollama2 and help each other,could you join us? Please contact me: WeChat number == LiangMeng19357260600, phone number == +86 19357260600,e-mail == liangmeng89@zju.edu.cn.

当然~,不好意思才看到,虽然我换了框架,但他们整体都是基于llava的,我觉得有相同的地方