DAMO-NLP-SG / VideoLLaMA2

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Apache License 2.0
850 stars 58 forks source link

Segmentation fault with the provided inference code #66

Open yankee624 opened 3 months ago

yankee624 commented 3 months ago

I installed the repo with the provided script:

git clone https://github.com/DAMO-NLP-SG/VideoLLaMA2
cd VideoLLaMA2
pip install -r requirements.txt
pip install flash-attn==2.5.8 --no-build-isolation

The commit I used was 0b1a87fa7ed056f3e5bdfbf12dcf7979c1ac10f8 which is recent one.

Then, without changing any code, I ran the following inference code provided by the repo:

import torch
import transformers
import sys
sys.path.append('./')
from videollama2.conversation import conv_templates, SeparatorStyle
from videollama2.constants import DEFAULT_MMODAL_TOKEN, MMODAL_TOKEN_INDEX
from videollama2.mm_utils import get_model_name_from_path, tokenizer_MMODAL_token, KeywordsStoppingCriteria, process_video, process_image
from videollama2.model.builder import load_pretrained_model
def inference():
    # Image Inference
    paths = ['assets/sora.png']
    questions = ['What is the woman wearing, what is she doing, and how does the image feel?']
    # Reply:
    # The woman in the image is wearing a black coat and sunglasses, and she is walking down a rain-soaked city street. The image feels vibrant and lively, with the bright city lights reflecting off the wet pavement, creating a visually appealing atmosphere. The woman's presence adds a sense of style and confidence to the scene, as she navigates the bustling urban environment.
    modal_list = ['image']
    # 1. Initialize the model.
    # model_path = 'DAMO-NLP-SG/VideoLLaMA2-8x7B-Base'
    model_path = "DAMO-NLP-SG/VideoLLaMA2-7B"
    model_name = get_model_name_from_path(model_path)
    tokenizer, model, processor, context_len = load_pretrained_model(model_path, None, model_name)
    model = model.to('cuda:0')
    conv_mode = 'llama2'
    # 2. Visual preprocess (load & transform image or video).
    if modal_list[0] == 'video':
        tensor = process_video(paths[0], processor, model.config.image_aspect_ratio).to(dtype=torch.float16, device='cuda', non_blocking=True)
        default_mm_token = DEFAULT_MMODAL_TOKEN["VIDEO"]
        modal_token_index = MMODAL_TOKEN_INDEX["VIDEO"]
    else:
        tensor = process_image(paths[0], processor, model.config.image_aspect_ratio)[0].to(dtype=torch.float16, device='cuda', non_blocking=True)
        default_mm_token = DEFAULT_MMODAL_TOKEN["IMAGE"]
        modal_token_index = MMODAL_TOKEN_INDEX["IMAGE"]
    tensor = [tensor]
    # 3. Text preprocess (tag process & generate prompt).
    question = default_mm_token + "\n" + questions[0]
    conv = conv_templates[conv_mode].copy()
    conv.append_message(conv.roles[0], question)
    conv.append_message(conv.roles[1], None)
    prompt = conv.get_prompt()
    input_ids = tokenizer_MMODAL_token(prompt, tokenizer, modal_token_index, return_tensors='pt').unsqueeze(0).to('cuda:0')
    # 4. Generate a response according to visual signals and prompts. 
    stop_str = conv.sep if conv.sep_style in [SeparatorStyle.SINGLE] else conv.sep2
    # keywords = ["<s>", "</s>"]
    keywords = [stop_str]
    stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
    with torch.inference_mode():
        output_ids = model.generate(
            input_ids,
            images_or_videos=tensor,
            modal_list=modal_list,
            do_sample=True,
            temperature=0.2,
            max_new_tokens=1024,
            use_cache=True,
            stopping_criteria=[stopping_criteria],
        )
    outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
    print(outputs[0])
if __name__ == "__main__":
    inference()

I get the following output:

Loading VideoLLaMA 2 from base model...
You are using a model of type mixtral to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
Downloading shards: 100%|██████████████████████████████████████████████████████████████████████| 19/19 [00:00<00:00, 29.97it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████| 19/19 [00:03<00:00,  6.20it/s]
Some weights of Videollama2MistralForCausalLM were not initialized from the model checkpoint at mistralai/Mixtral-8x7B-Instruct-v0.1 and are newly initialized: ['model.layers.0.mlp.down_proj.weight', 'model.layers.0.mlp.gate_proj.weight', 'model.layers.0.mlp.up_proj.weight', 'model.layers.1.mlp.down_proj.weight', 'model.layers.1.mlp.gate_proj.weight', 'model.layers.1.mlp.up_proj.weight', 'model.layers.10.mlp.down_proj.weight', 'model.layers.10.mlp.gate_proj.weight', 'model.layers.10.mlp.up_proj.weight', 'model.layers.11.mlp.down_proj.weight', 'model.layers.11.mlp.gate_proj.weight', 'model.layers.11.mlp.up_proj.weight', 'model.layers.12.mlp.down_proj.weight', 'model.layers.12.mlp.gate_proj.weight', 'model.layers.12.mlp.up_proj.weight', 'model.layers.13.mlp.down_proj.weight', 'model.layers.13.mlp.gate_proj.weight', 'model.layers.13.mlp.up_proj.weight', 'model.layers.14.mlp.down_proj.weight', 'model.layers.14.mlp.gate_proj.weight', 'model.layers.14.mlp.up_proj.weight', 'model.layers.15.mlp.down_proj.weight', 'model.layers.15.mlp.gate_proj.weight', 'model.layers.15.mlp.up_proj.weight', 'model.layers.16.mlp.down_proj.weight', 'model.layers.16.mlp.gate_proj.weight', 'model.layers.16.mlp.up_proj.weight', 'model.layers.17.mlp.down_proj.weight', 'model.layers.17.mlp.gate_proj.weight', 'model.layers.17.mlp.up_proj.weight', 'model.layers.18.mlp.down_proj.weight', 'model.layers.18.mlp.gate_proj.weight', 'model.layers.18.mlp.up_proj.weight', 'model.layers.19.mlp.down_proj.weight', 'model.layers.19.mlp.gate_proj.weight', 'model.layers.19.mlp.up_proj.weight', 'model.layers.2.mlp.down_proj.weight', 'model.layers.2.mlp.gate_proj.weight', 'model.layers.2.mlp.up_proj.weight', 'model.layers.20.mlp.down_proj.weight', 'model.layers.20.mlp.gate_proj.weight', 'model.layers.20.mlp.up_proj.weight', 'model.layers.21.mlp.down_proj.weight', 'model.layers.21.mlp.gate_proj.weight', 'model.layers.21.mlp.up_proj.weight', 'model.layers.22.mlp.down_proj.weight', 'model.layers.22.mlp.gate_proj.weight', 'model.layers.22.mlp.up_proj.weight', 'model.layers.23.mlp.down_proj.weight', 'model.layers.23.mlp.gate_proj.weight', 'model.layers.23.mlp.up_proj.weight', 'model.layers.24.mlp.down_proj.weight', 'model.layers.24.mlp.gate_proj.weight', 'model.layers.24.mlp.up_proj.weight', 'model.layers.25.mlp.down_proj.weight', 'model.layers.25.mlp.gate_proj.weight', 'model.layers.25.mlp.up_proj.weight', 'model.layers.26.mlp.down_proj.weight', 'model.layers.26.mlp.gate_proj.weight', 'model.layers.26.mlp.up_proj.weight', 'model.layers.27.mlp.down_proj.weight', 'model.layers.27.mlp.gate_proj.weight', 'model.layers.27.mlp.up_proj.weight', 'model.layers.28.mlp.down_proj.weight', 'model.layers.28.mlp.gate_proj.weight', 'model.layers.28.mlp.up_proj.weight', 'model.layers.29.mlp.down_proj.weight', 'model.layers.29.mlp.gate_proj.weight', 'model.layers.29.mlp.up_proj.weight', 'model.layers.3.mlp.down_proj.weight', 'model.layers.3.mlp.gate_proj.weight', 'model.layers.3.mlp.up_proj.weight', 'model.layers.30.mlp.down_proj.weight', 'model.layers.30.mlp.gate_proj.weight', 'model.layers.30.mlp.up_proj.weight', 'model.layers.31.mlp.down_proj.weight', 'model.layers.31.mlp.gate_proj.weight', 'model.layers.31.mlp.up_proj.weight', 'model.layers.4.mlp.down_proj.weight', 'model.layers.4.mlp.gate_proj.weight', 'model.layers.4.mlp.up_proj.weight', 'model.layers.5.mlp.down_proj.weight', 'model.layers.5.mlp.gate_proj.weight', 'model.layers.5.mlp.up_proj.weight', 'model.layers.6.mlp.down_proj.weight', 'model.layers.6.mlp.gate_proj.weight', 'model.layers.6.mlp.up_proj.weight', 'model.layers.7.mlp.down_proj.weight', 'model.layers.7.mlp.gate_proj.weight', 'model.layers.7.mlp.up_proj.weight', 'model.layers.8.mlp.down_proj.weight', 'model.layers.8.mlp.gate_proj.weight', 'model.layers.8.mlp.up_proj.weight', 'model.layers.9.mlp.down_proj.weight', 'model.layers.9.mlp.gate_proj.weight', 'model.layers.9.mlp.up_proj.weight', 'model.mm_projector.readout.0.bias', 'model.mm_projector.readout.0.weight', 'model.mm_projector.readout.2.bias', 'model.mm_projector.readout.2.weight', 'model.mm_projector.s1.b1.conv1.bn.bias', 'model.mm_projector.s1.b1.conv1.bn.weight', 'model.mm_projector.s1.b1.conv1.conv.weight', 'model.mm_projector.s1.b1.conv2.bn.bias', 'model.mm_projector.s1.b1.conv2.bn.weight', 'model.mm_projector.s1.b1.conv2.conv.weight', 'model.mm_projector.s1.b1.conv3.bn.bias', 'model.mm_projector.s1.b1.conv3.bn.weight', 'model.mm_projector.s1.b1.conv3.conv.weight', 'model.mm_projector.s1.b1.downsample.bn.bias', 'model.mm_projector.s1.b1.downsample.bn.weight', 'model.mm_projector.s1.b1.downsample.conv.weight', 'model.mm_projector.s1.b1.se.fc1.bias', 'model.mm_projector.s1.b1.se.fc1.weight', 'model.mm_projector.s1.b1.se.fc2.bias', 'model.mm_projector.s1.b1.se.fc2.weight', 'model.mm_projector.s1.b2.conv1.bn.bias', 'model.mm_projector.s1.b2.conv1.bn.weight', 'model.mm_projector.s1.b2.conv1.conv.weight', 'model.mm_projector.s1.b2.conv2.bn.bias', 'model.mm_projector.s1.b2.conv2.bn.weight', 'model.mm_projector.s1.b2.conv2.conv.weight', 'model.mm_projector.s1.b2.conv3.bn.bias', 'model.mm_projector.s1.b2.conv3.bn.weight', 'model.mm_projector.s1.b2.conv3.conv.weight', 'model.mm_projector.s1.b2.se.fc1.bias', 'model.mm_projector.s1.b2.se.fc1.weight', 'model.mm_projector.s1.b2.se.fc2.bias', 'model.mm_projector.s1.b2.se.fc2.weight', 'model.mm_projector.s1.b3.conv1.bn.bias', 'model.mm_projector.s1.b3.conv1.bn.weight', 'model.mm_projector.s1.b3.conv1.conv.weight', 'model.mm_projector.s1.b3.conv2.bn.bias', 'model.mm_projector.s1.b3.conv2.bn.weight', 'model.mm_projector.s1.b3.conv2.conv.weight', 'model.mm_projector.s1.b3.conv3.bn.bias', 'model.mm_projector.s1.b3.conv3.bn.weight', 'model.mm_projector.s1.b3.conv3.conv.weight', 'model.mm_projector.s1.b3.se.fc1.bias', 'model.mm_projector.s1.b3.se.fc1.weight', 'model.mm_projector.s1.b3.se.fc2.bias', 'model.mm_projector.s1.b3.se.fc2.weight', 'model.mm_projector.s1.b4.conv1.bn.bias', 'model.mm_projector.s1.b4.conv1.bn.weight', 'model.mm_projector.s1.b4.conv1.conv.weight', 'model.mm_projector.s1.b4.conv2.bn.bias', 'model.mm_projector.s1.b4.conv2.bn.weight', 'model.mm_projector.s1.b4.conv2.conv.weight', 'model.mm_projector.s1.b4.conv3.bn.bias', 'model.mm_projector.s1.b4.conv3.bn.weight', 'model.mm_projector.s1.b4.conv3.conv.weight', 'model.mm_projector.s1.b4.se.fc1.bias', 'model.mm_projector.s1.b4.se.fc1.weight', 'model.mm_projector.s1.b4.se.fc2.bias', 'model.mm_projector.s1.b4.se.fc2.weight', 'model.mm_projector.s2.b1.conv1.bn.bias', 'model.mm_projector.s2.b1.conv1.bn.weight', 'model.mm_projector.s2.b1.conv1.conv.weight', 'model.mm_projector.s2.b1.conv2.bn.bias', 'model.mm_projector.s2.b1.conv2.bn.weight', 'model.mm_projector.s2.b1.conv2.conv.weight', 'model.mm_projector.s2.b1.conv3.bn.bias', 'model.mm_projector.s2.b1.conv3.bn.weight', 'model.mm_projector.s2.b1.conv3.conv.weight', 'model.mm_projector.s2.b1.se.fc1.bias', 'model.mm_projector.s2.b1.se.fc1.weight', 'model.mm_projector.s2.b1.se.fc2.bias', 'model.mm_projector.s2.b1.se.fc2.weight', 'model.mm_projector.s2.b2.conv1.bn.bias', 'model.mm_projector.s2.b2.conv1.bn.weight', 'model.mm_projector.s2.b2.conv1.conv.weight', 'model.mm_projector.s2.b2.conv2.bn.bias', 'model.mm_projector.s2.b2.conv2.bn.weight', 'model.mm_projector.s2.b2.conv2.conv.weight', 'model.mm_projector.s2.b2.conv3.bn.bias', 'model.mm_projector.s2.b2.conv3.bn.weight', 'model.mm_projector.s2.b2.conv3.conv.weight', 'model.mm_projector.s2.b2.se.fc1.bias', 'model.mm_projector.s2.b2.se.fc1.weight', 'model.mm_projector.s2.b2.se.fc2.bias', 'model.mm_projector.s2.b2.se.fc2.weight', 'model.mm_projector.s2.b3.conv1.bn.bias', 'model.mm_projector.s2.b3.conv1.bn.weight', 'model.mm_projector.s2.b3.conv1.conv.weight', 'model.mm_projector.s2.b3.conv2.bn.bias', 'model.mm_projector.s2.b3.conv2.bn.weight', 'model.mm_projector.s2.b3.conv2.conv.weight', 'model.mm_projector.s2.b3.conv3.bn.bias', 'model.mm_projector.s2.b3.conv3.bn.weight', 'model.mm_projector.s2.b3.conv3.conv.weight', 'model.mm_projector.s2.b3.se.fc1.bias', 'model.mm_projector.s2.b3.se.fc1.weight', 'model.mm_projector.s2.b3.se.fc2.bias', 'model.mm_projector.s2.b3.se.fc2.weight', 'model.mm_projector.s2.b4.conv1.bn.bias', 'model.mm_projector.s2.b4.conv1.bn.weight', 'model.mm_projector.s2.b4.conv1.conv.weight', 'model.mm_projector.s2.b4.conv2.bn.bias', 'model.mm_projector.s2.b4.conv2.bn.weight', 'model.mm_projector.s2.b4.conv2.conv.weight', 'model.mm_projector.s2.b4.conv3.bn.bias', 'model.mm_projector.s2.b4.conv3.bn.weight', 'model.mm_projector.s2.b4.conv3.conv.weight', 'model.mm_projector.s2.b4.se.fc1.bias', 'model.mm_projector.s2.b4.se.fc1.weight', 'model.mm_projector.s2.b4.se.fc2.bias', 'model.mm_projector.s2.b4.se.fc2.weight', 'model.mm_projector.sampler.0.bias', 'model.mm_projector.sampler.0.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You shouldn't move a model when it is dispatched on multiple devices.
Segmentation fault (core dumped)

Similar things happen with other model versions like Mixtral 8x7B, or the Base models. How should I fix it?

wenrousha commented 3 months ago

I also had this problem, have you solved it yet?

yankee624 commented 3 months ago

@wenrousha Nope

mhardik003 commented 2 months ago

Same error, please tell if anyone has solved this error.