DAMO-NLP-SG / VideoLLaMA2

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Apache License 2.0
907 stars 60 forks source link

how to do the inference with the finetune weights / model #83

Closed thisurawz1 closed 1 month ago

thisurawz1 commented 2 months ago

345262460-cb6c8569-4307-4275-b536-21aa253d9eee I have already fine-tuned the videollama2 for a custom dataset using qlora. after fine-tuning got the above files. now, how can I make the inference with those weights/ models? how can I use this finetune weights/ model with the inference script you provided?

Looking forward to a solution as soon as possible. thank you.

` import sys sys.path.append('./') from videollama2 import model_init, mm_infer from videollama2.utils import disable_torch_init

def inference(): disable_torch_init()

# Video Inference
modal = 'video'
modal_path = 'assets/cat_and_chicken.mp4' 
instruct = 'What animals are in the video, what are they doing, and how does the video feel?'
# Reply:
# The video features a kitten and a baby chick playing together. The kitten is seen laying on the floor while the baby chick hops around. The two animals interact playfully with each other, and the video has a cute and heartwarming feel to it.

# Image Inference
modal = 'image'
modal_path = 'assets/sora.png'
instruct = 'What is the woman wearing, what is she doing, and how does the image feel?'
# Reply:
# The woman in the image is wearing a black coat and sunglasses, and she is walking down a rain-soaked city street. The image feels vibrant and lively, with the bright city lights reflecting off the wet pavement, creating a visually appealing atmosphere. The woman's presence adds a sense of style and confidence to the scene, as she navigates the bustling urban environment.

model_path = 'DAMO-NLP-SG/VideoLLaMA2-7B'
# Base model inference (only need to replace model_path)
# model_path = 'DAMO-NLP-SG/VideoLLaMA2-7B-Base'
model, processor, tokenizer = model_init(model_path)
output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)

print(output)

if name == "main": inference() `

clownrat6 commented 2 months ago

Yes, you can. The newest version commit supports directly loading lora model.

thisurawz1 commented 2 months ago

Can you share the script for it please. Do we just have to change the current model path to lora path. I did it but didn't work at all.

thisurawz1 commented 2 months ago

can you share the exact script that we can do the inference with the LoRA weights. please.

thisurawz1 commented 2 months ago

Yes, you can. The newest version commit supports directly loading the Lora model.

Can you share the script on how to load the Lora model directly? I already finished the fine-tuning. And got those files. But I don't know how to do the inference with these. image

image

LiangMeng89 commented 1 month ago

Yes, you can. The newest version commit supports directly loading the Lora model.

Can you share the script on how to load the Lora model directly? I already finished the fine-tuning. And got those files. But I don't know how to do the inference with these. image

image

Hello! I have the same problem. Have you solved it?

ffcarina commented 1 month ago

Can you share the script on how to load the Lora model directly? I already finished the fine-tuning. And got those files. But I don't know how to do the inference with these.

@thisurawz1 Through the following code, I successfully loaded the LoRA fine-tuned model for inference. Hope this helps you.

from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init

disable_torch_init()

modal = 'video'
modal_path = 'VideoLLaMA2/videollama2/serve/examples/sample_demo_1.mp4'
instruct = 'What is the baby wearing and what is he doing?'
model_path = 'VideoLLaMA2/work_dirs/videollama2/finetune_downstream_sft_settings_qlora_MESC' # your model dir

model, processor, tokenizer = model_init(model_path)
output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)
thisurawz1 commented 1 month ago

Yes, you can. The newest version commit supports directly loading the Lora model.

Can you share the script on how to load the Lora model directly? I already finished the fine-tuning. And got those files. But I don't know how to do the inference with these. image image

Hello! I have the same problem. Have you solved it?

thisurawz1 commented 1 month ago

Yes, you can. The newest version commit supports directly loading the Lora model.

Can you share the script on how to load the Lora model directly? I already finished the fine-tuning. And got those files. But I don't know how to do the inference with these. image image

Hello! I have the same problem. Have you solved it?

yes. you have to update the videollama2 repository to the latest commit. then use the following script. just have to change the model path in the original inference script. thats all.

import sys
sys.path.append('./')
from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init

def inference():
    disable_torch_init()

    # Video Inference
    modal = 'video'
    modal_path = 'assets/cat_and_chicken.mp4' 
    instruct = 'What animals are in the video, what are they doing, and how does the video feel?'
    # Reply:
    # The video features a kitten and a baby chick playing together. The kitten is seen laying on the floor while the baby chick hops around. The two animals interact playfully with each other, and the video has a cute and heartwarming feel to it.

    # Image Inference
    modal = 'image'
    modal_path = 'assets/sora.png'
    instruct = 'What is the woman wearing, what is she doing, and how does the image feel?'
    # Reply:
    # The woman in the image is wearing a black coat and sunglasses, and she is walking down a rain-soaked city street. The image feels vibrant and lively, with the bright city lights reflecting off the wet pavement, creating a visually appealing atmosphere. The woman's presence adds a sense of style and confidence to the scene, as she navigates the bustling urban environment.

    model_path = 'DAMO-NLP-SG/VideoLLaMA2-7B'
    # Base model inference (only need to replace model_path)
    # model_path = 'work_dirs/videollama2/finetune_downstream_sft_settings_qlora' #your fine-tuned weights directory 
    model, processor, tokenizer = model_init(model_path)
    output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)

    print(output)

if __name__ == "__main__":
    inference()
thisurawz1 commented 1 month ago

Can you share the script on how to load the Lora model directly? I already finished the fine-tuning. And got those files. But I don't know how to do the inference with these.

@thisurawz1 Through the following code, I successfully loaded the LoRA fine-tuned model for inference. Hope this helps you.

from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init

disable_torch_init()

modal = 'video'
modal_path = 'VideoLLaMA2/videollama2/serve/examples/sample_demo_1.mp4'
instruct = 'What is the baby wearing and what is he doing?'
model_path = 'VideoLLaMA2/work_dirs/videollama2/finetune_downstream_sft_settings_qlora_MESC' # your model dir

model, processor, tokenizer = model_init(model_path)
output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)

Thank you so much

LiangMeng89 commented 4 weeks ago

Yes, you can. The newest version commit supports directly loading the Lora model.

Can you share the script on how to load the Lora model directly? I already finished the fine-tuning. And got those files. But I don't know how to do the inference with these. image image

Hello! I have the same problem. Have you solved it?

yes. you have to update the videollama2 repository to the latest commit. then use the following script. just have to change the model path in the original inference script. thats all.

import sys
sys.path.append('./')
from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init

def inference():
    disable_torch_init()

    # Video Inference
    modal = 'video'
    modal_path = 'assets/cat_and_chicken.mp4' 
    instruct = 'What animals are in the video, what are they doing, and how does the video feel?'
    # Reply:
    # The video features a kitten and a baby chick playing together. The kitten is seen laying on the floor while the baby chick hops around. The two animals interact playfully with each other, and the video has a cute and heartwarming feel to it.

    # Image Inference
    modal = 'image'
    modal_path = 'assets/sora.png'
    instruct = 'What is the woman wearing, what is she doing, and how does the image feel?'
    # Reply:
    # The woman in the image is wearing a black coat and sunglasses, and she is walking down a rain-soaked city street. The image feels vibrant and lively, with the bright city lights reflecting off the wet pavement, creating a visually appealing atmosphere. The woman's presence adds a sense of style and confidence to the scene, as she navigates the bustling urban environment.

    model_path = 'DAMO-NLP-SG/VideoLLaMA2-7B'
    # Base model inference (only need to replace model_path)
    # model_path = 'work_dirs/videollama2/finetune_downstream_sft_settings_qlora' #your fine-tuned weights directory 
    model, processor, tokenizer = model_init(model_path)
    output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)

    print(output)

if __name__ == "__main__":
    inference()

Thank you, I will try this.

LiangMeng89 commented 3 weeks ago

Yes, you can. The newest version commit supports directly loading lora model.

Dear author,I used your lora checkpoint folder structure and loading example code(https://github.com/DAMO-NLP-SG/VideoLLaMA2/issues/36) to my fintue_qlora inference code on my own experiment video data, but it still has some errors.The old inference code of readme file is work.I just put your code in the code. Please help me!

1: My fintue_qlora inference code:

import torch import transformers

import sys sys.path.append('./')

from videollama2.conversation import conv_templates from videollama2.constants import DEFAULT_MMODAL_TOKEN, MMODAL_TOKEN_INDEX from videollama2.mm_utils import get_model_name_from_path, tokenizer_MMODAL_token, process_video, process_image from videollama2.model.builder import load_pretrained_model

def inference():

Video Inference

paths = ['./datasets/test_data/videos/video_202.mp4']
questions = ['****************************hidden********************************']
# Reply:
modal_list = ['video']

# Image Inference
#paths = ['assets/sora.png']
#questions = ['What is the woman wearing, what is she doing, and how does the image feel?']
# Reply:
# The woman in the image is wearing a black coat and sunglasses, and she is walking down a rain-soaked city street. The image feels vibrant and lively, with the bright city lights reflecting off the wet pavement, creating a visually appealing atmosphere. The woman's presence adds a sense of style and confidence to the scene, as she navigates the bustling urban environment.
#modal_list = ['image']

# 1. Initialize the model.
model_path = './checkpoints/VideoLLaMA2-7B-qlora'   #./checkpoints/VideoLLaMA2-7B
# Base model inference (only need to replace model_path)
# model_path = 'DAMO-NLP-SG/VideoLLaMA2-7B-Base'
model_name = get_model_name_from_path(model_path)
tokenizer, model, processor, context_len = load_pretrained_model(model_path, './checkpoints/Mistral-7B-Instruct-v0.2', model_name)  # None
model = model.to('cuda:0')
conv_mode = 'llama2'

# 2. Visual preprocess (load & transform image or video).
if modal_list[0] == 'video':
    tensor = process_video(paths[0], processor, model.config.image_aspect_ratio).to(dtype=torch.float16, device='cuda', non_blocking=True)
    default_mm_token = DEFAULT_MMODAL_TOKEN["VIDEO"]
    modal_token_index = MMODAL_TOKEN_INDEX["VIDEO"]
else:
    tensor = process_image(paths[0], processor, model.config.image_aspect_ratio)[0].to(dtype=torch.float16, device='cuda', non_blocking=True)
    default_mm_token = DEFAULT_MMODAL_TOKEN["IMAGE"]
    modal_token_index = MMODAL_TOKEN_INDEX["IMAGE"]
tensor = [tensor]

# 3. text preprocess (tag process & generate prompt).
question = default_mm_token + "\n" + questions[0]
conv = conv_templates[conv_mode].copy()
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
input_ids = tokenizer_MMODAL_token(prompt, tokenizer, modal_token_index, return_tensors='pt').unsqueeze(0).to('cuda:0')

with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        images_or_videos=tensor,
        modal_list=modal_list,
        do_sample=True,
        temperature=0.2,
        max_new_tokens=1024,
        use_cache=True,
    )

outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
print(outputs[0])

if name == "main": inference()

2: Terminal errors: (videollama2) lm@SR6430G23:~/videollama2/VideoLLaMA2$ /home/lm/anaconda3/envs/videollama2/bin/python inference.py 200 Loading VideoLLaMA from base model... Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:13<00:00, 4.36s/it] Some weights of Videollama2MistralForCausalLM were not initialized from the model checkpoint at ./checkpoints/Mistral-7B-Instruct-v0.2 and are newly initialized: ['model.mm_projector.readout.0.bias', 'model.mm_projector.readout.0.weight', 'model.mm_projector.readout.2.bias', 'model.mm_projector.readout.2.weight', 'model.mm_projector.s1.b1.conv1.bn.bias', 'model.mm_projector.s1.b1.conv1.bn.weight', 'model.mm_projector.s1.b1.conv1.conv.weight', 'model.mm_projector.s1.b1.conv2.bn.bias', 'model.mm_projector.s1.b1.conv2.bn.weight', 'model.mm_projector.s1.b1.conv2.conv.weight', 'model.mm_projector.s1.b1.conv3.bn.bias', 'model.mm_projector.s1.b1.conv3.bn.weight', 'model.mm_projector.s1.b1.conv3.conv.weight', 'model.mm_projector.s1.b1.downsample.bn.bias', 'model.mm_projector.s1.b1.downsample.bn.weight', 'model.mm_projector.s1.b1.downsample.conv.weight', 'model.mm_projector.s1.b1.se.fc1.bias', 'model.mm_projector.s1.b1.se.fc1.weight', 'model.mm_projector.s1.b1.se.fc2.bias', 'model.mm_projector.s1.b1.se.fc2.weight', 'model.mm_projector.s1.b2.conv1.bn.bias', 'model.mm_projector.s1.b2.conv1.bn.weight', 'model.mm_projector.s1.b2.conv1.conv.weight', 'model.mm_projector.s1.b2.conv2.bn.bias', 'model.mm_projector.s1.b2.conv2.bn.weight', 'model.mm_projector.s1.b2.conv2.conv.weight', 'model.mm_projector.s1.b2.conv3.bn.bias', 'model.mm_projector.s1.b2.conv3.bn.weight', 'model.mm_projector.s1.b2.conv3.conv.weight', 'model.mm_projector.s1.b2.se.fc1.bias', 'model.mm_projector.s1.b2.se.fc1.weight', 'model.mm_projector.s1.b2.se.fc2.bias', 'model.mm_projector.s1.b2.se.fc2.weight', 'model.mm_projector.s1.b3.conv1.bn.bias', 'model.mm_projector.s1.b3.conv1.bn.weight', 'model.mm_projector.s1.b3.conv1.conv.weight', 'model.mm_projector.s1.b3.conv2.bn.bias', 'model.mm_projector.s1.b3.conv2.bn.weight', 'model.mm_projector.s1.b3.conv2.conv.weight', 'model.mm_projector.s1.b3.conv3.bn.bias', 'model.mm_projector.s1.b3.conv3.bn.weight', 'model.mm_projector.s1.b3.conv3.conv.weight', 'model.mm_projector.s1.b3.se.fc1.bias', 'model.mm_projector.s1.b3.se.fc1.weight', 'model.mm_projector.s1.b3.se.fc2.bias', 'model.mm_projector.s1.b3.se.fc2.weight', 'model.mm_projector.s1.b4.conv1.bn.bias', 'model.mm_projector.s1.b4.conv1.bn.weight', 'model.mm_projector.s1.b4.conv1.conv.weight', 'model.mm_projector.s1.b4.conv2.bn.bias', 'model.mm_projector.s1.b4.conv2.bn.weight', 'model.mm_projector.s1.b4.conv2.conv.weight', 'model.mm_projector.s1.b4.conv3.bn.bias', 'model.mm_projector.s1.b4.conv3.bn.weight', 'model.mm_projector.s1.b4.conv3.conv.weight', 'model.mm_projector.s1.b4.se.fc1.bias', 'model.mm_projector.s1.b4.se.fc1.weight', 'model.mm_projector.s1.b4.se.fc2.bias', 'model.mm_projector.s1.b4.se.fc2.weight', 'model.mm_projector.s2.b1.conv1.bn.bias', 'model.mm_projector.s2.b1.conv1.bn.weight', 'model.mm_projector.s2.b1.conv1.conv.weight', 'model.mm_projector.s2.b1.conv2.bn.bias', 'model.mm_projector.s2.b1.conv2.bn.weight', 'model.mm_projector.s2.b1.conv2.conv.weight', 'model.mm_projector.s2.b1.conv3.bn.bias', 'model.mm_projector.s2.b1.conv3.bn.weight', 'model.mm_projector.s2.b1.conv3.conv.weight', 'model.mm_projector.s2.b1.se.fc1.bias', 'model.mm_projector.s2.b1.se.fc1.weight', 'model.mm_projector.s2.b1.se.fc2.bias', 'model.mm_projector.s2.b1.se.fc2.weight', 'model.mm_projector.s2.b2.conv1.bn.bias', 'model.mm_projector.s2.b2.conv1.bn.weight', 'model.mm_projector.s2.b2.conv1.conv.weight', 'model.mm_projector.s2.b2.conv2.bn.bias', 'model.mm_projector.s2.b2.conv2.bn.weight', 'model.mm_projector.s2.b2.conv2.conv.weight', 'model.mm_projector.s2.b2.conv3.bn.bias', 'model.mm_projector.s2.b2.conv3.bn.weight', 'model.mm_projector.s2.b2.conv3.conv.weight', 'model.mm_projector.s2.b2.se.fc1.bias', 'model.mm_projector.s2.b2.se.fc1.weight', 'model.mm_projector.s2.b2.se.fc2.bias', 'model.mm_projector.s2.b2.se.fc2.weight', 'model.mm_projector.s2.b3.conv1.bn.bias', 'model.mm_projector.s2.b3.conv1.bn.weight', 'model.mm_projector.s2.b3.conv1.conv.weight', 'model.mm_projector.s2.b3.conv2.bn.bias', 'model.mm_projector.s2.b3.conv2.bn.weight', 'model.mm_projector.s2.b3.conv2.conv.weight', 'model.mm_projector.s2.b3.conv3.bn.bias', 'model.mm_projector.s2.b3.conv3.bn.weight', 'model.mm_projector.s2.b3.conv3.conv.weight', 'model.mm_projector.s2.b3.se.fc1.bias', 'model.mm_projector.s2.b3.se.fc1.weight', 'model.mm_projector.s2.b3.se.fc2.bias', 'model.mm_projector.s2.b3.se.fc2.weight', 'model.mm_projector.s2.b4.conv1.bn.bias', 'model.mm_projector.s2.b4.conv1.bn.weight', 'model.mm_projector.s2.b4.conv1.conv.weight', 'model.mm_projector.s2.b4.conv2.bn.bias', 'model.mm_projector.s2.b4.conv2.bn.weight', 'model.mm_projector.s2.b4.conv2.conv.weight', 'model.mm_projector.s2.b4.conv3.bn.bias', 'model.mm_projector.s2.b4.conv3.bn.weight', 'model.mm_projector.s2.b4.conv3.conv.weight', 'model.mm_projector.s2.b4.se.fc1.bias', 'model.mm_projector.s2.b4.se.fc1.weight', 'model.mm_projector.s2.b4.se.fc2.bias', 'model.mm_projector.s2.b4.se.fc2.weight', 'model.mm_projector.sampler.0.bias', 'model.mm_projector.sampler.0.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Loading additional VideoLLaMA weights... Loading LoRA weights... Merging LoRA weights... Model is loaded... Loading VideoLLaMA 2 from base model... You are using a model of type mistral to instantiate a model of type . This is not supported for all configurations of models and can yield errors. Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>. Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s] Traceback (most recent call last): File "inference.py", line 166, in inference() File "inference.py", line 127, in inference tokenizer, model, processor, context_len = load_pretrained_model(model_path, './checkpoints/Mistral-7B-Instruct-v0.2', model_name) # None File "/home/lm/videollama2/VideoLLaMA2/videollama2/model/builder.py", line 140, in load_pretrained_model model = Videollama2MistralForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, *kwargs) File "/home/lm/anaconda3/envs/videollama2/lib/python3.8/site-packages/transformers/modeling_utils.py", line 3754, in from_pretrained ) = cls._load_pretrained_model( File "/home/lm/anaconda3/envs/videollama2/lib/python3.8/site-packages/transformers/modeling_utils.py", line 4214, in _load_pretrained_model new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model( File "/home/lm/anaconda3/envs/videollama2/lib/python3.8/site-packages/transformers/modeling_utils.py", line 889, in _load_state_dict_into_meta_model hf_quantizer.create_quantized_param(model, param, param_name, param_device, state_dict, unexpected_keys) File "/home/lm/anaconda3/envs/videollama2/lib/python3.8/site-packages/transformers/quantizers/quantizer_bnb_4bit.py", line 190, in create_quantized_param raise ValueError( ValueError: Supplied state dict for model.layers.0.mlp.down_proj.weight does not contain `bitsandbytes__and possibly otherquantized_stats` components.

LiangMeng89 commented 1 week ago

Can you share the script on how to load the Lora model directly? I already finished the fine-tuning. And got those files. But I don't know how to do the inference with these.

@thisurawz1 Through the following code, I successfully loaded the LoRA fine-tuned model for inference. Hope this helps you.

from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init

disable_torch_init()

modal = 'video'
modal_path = 'VideoLLaMA2/videollama2/serve/examples/sample_demo_1.mp4'
instruct = 'What is the baby wearing and what is he doing?'
model_path = 'VideoLLaMA2/work_dirs/videollama2/finetune_downstream_sft_settings_qlora_MESC' # your model dir

model, processor, tokenizer = model_init(model_path)
output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)

Hello,I'm a phD student from ZJU, I also use videollama2 to do my own research,we create a WeChat group to discuss some issues of videollama2 and help each other,could you join us? Please contact me: WeChat number == LiangMeng19357260600, phone number == +86 19357260600,e-mail == liangmeng89@zju.edu.cn.