LLaVA-VL / LLaVA-NeXT

Apache License 2.0
2.4k stars 167 forks source link

The response issue between 0.5 and 7b #165

Open shihy2988 opened 3 weeks ago

shihy2988 commented 3 weeks ago

0.5b response is norm but 7b wrong

the same image,where i chage the code ispretrained = "/home/shihongyu/MMLM_models/lmms-lab/llava-onevision-qwen2-7b-ov" model_name = "llava_qwen" device = "cuda" device_map = "auto" but response is diff 0.5b response:['The image shows a conveyor belt in an industrial setting, likely part of a factory or processing plant. The conveyor is moving materials along its path, and there are some mechanical components visible on the left side of the frame.'] 7b response :['!']

xsgldhy commented 3 weeks ago

meet the same problem when using qwen2-7b-ov to inference, how to fix this? In the LLaVA wetchat group, someone says it may due to the version of transformers library, and a guy has updated its transformers library version to 4.41.2, then the response changes to [''], still not correct

xsgldhy commented 3 weeks ago

I checked the output tensor of the decoder layers, and found that after the last decoder layer, the features of the last token were all "nans", image this is very weird. Because of this problem, I failed to reproduce the results on the video benchmarks. Does anyone know how to solve this problem? @ZhangYuanhan-AI @Luodian

Luodian commented 3 weeks ago

image Here's my reproduced result with ov-7b.

image

My code is:

from operator import attrgetter
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llava.conversation import conv_templates, SeparatorStyle

import torch
import cv2
import numpy as np
from PIL import Image
import requests
import copy
import warnings
from decord import VideoReader, cpu

warnings.filterwarnings("ignore")
# Load the OneVision model
pretrained = "lmms-lab/llava-onevision-qwen2-7b-ov"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map, attn_implementation="sdpa")

model.eval()

# Function to extract frames from video
def load_video(video_path, max_frames_num):
    if type(video_path) == str:
        vr = VideoReader(video_path, ctx=cpu(0))
    else:
        vr = VideoReader(video_path[0], ctx=cpu(0))
    total_frame_num = len(vr)
    uniform_sampled_frames = np.linspace(0, total_frame_num - 1, max_frames_num, dtype=int)
    frame_idx = uniform_sampled_frames.tolist()
    spare_frames = vr.get_batch(frame_idx).asnumpy()
    return spare_frames  # (frames, height, width, channels)

# Load and process video
video_path = "jobs.mp4"
video_frames = load_video(video_path, 16)
print(video_frames.shape) # (16, 1024, 576, 3)
image_tensors = []
frames = image_processor.preprocess(video_frames, return_tensors="pt")["pixel_values"].half().cuda()
image_tensors.append(frames)

# Prepare conversation input
conv_template = "qwen_1_5"
question = f"{DEFAULT_IMAGE_TOKEN}\nDescribe what's happening in this video."

conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()

input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
image_sizes = [frame.size for frame in video_frames]

# Generate response
cont = model.generate(
    input_ids,
    images=image_tensors,
    image_sizes=image_sizes,
    do_sample=False,
    temperature=0,
    max_new_tokens=4096,
    modalities=["video"],
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
print(text_outputs[0])

Maybe you installed wrong version llava? like installed llava 1.5 instead of our current version llava? You can try to clean conda and install llava.

CyrusCY commented 3 weeks ago

I've met the same response issue in 7b. But after I changed the torch version to 2.1.2 and transformers version to 4.40.0. The response shows correctly.

Loaded LLaVA model: lmms-lab/llava-onevision-qwen2-7b-ov
Loading vision tower: google/siglip-so400m-patch14-384
Model Class: LlavaQwenForCausalLM
['The image shows a radar chart, also known as a spider chart or a star chart, which is used to compare multiple quantitative variables. Each axis represents a different variable, and the values are plotted along these axes. The chart is color-coded to represent different categories or models, with lines connecting the points for each category.\n\nIn this particular radar chart, it appears to be comparing various models or systems across several performance metrics. The labels on the axes suggest that the metrics being compared could include aspects like "MMB-Vet," "Llava-1.5," "VQA," "GQA," "SQA-IMG," "TextVQA," "MME," "BLIP-2," "InstructBLIP," "Pope," and "Qwen-VL-Chat." The numbers along the axes likely represent scores or measurements of performance for each model in those respective categories.\n\nThe purpose of such a chart is to provide a visual summary of how each model performs relative to the others across all the measured attributes. It\'s a useful tool for quickly assessing strengths and weaknesses of different models or systems at a glance.']
CyrusCY commented 3 weeks ago

Maybe u could use the same requirements.txt in the main branch.

rookiez7 commented 2 weeks ago

I've met the same response issue in 7b. But after I changed the torch version to 2.1.2 and transformers version to 4.40.0. The response shows correctly.

Loaded LLaVA model: lmms-lab/llava-onevision-qwen2-7b-ov
Loading vision tower: google/siglip-so400m-patch14-384
Model Class: LlavaQwenForCausalLM
['The image shows a radar chart, also known as a spider chart or a star chart, which is used to compare multiple quantitative variables. Each axis represents a different variable, and the values are plotted along these axes. The chart is color-coded to represent different categories or models, with lines connecting the points for each category.\n\nIn this particular radar chart, it appears to be comparing various models or systems across several performance metrics. The labels on the axes suggest that the metrics being compared could include aspects like "MMB-Vet," "Llava-1.5," "VQA," "GQA," "SQA-IMG," "TextVQA," "MME," "BLIP-2," "InstructBLIP," "Pope," and "Qwen-VL-Chat." The numbers along the axes likely represent scores or measurements of performance for each model in those respective categories.\n\nThe purpose of such a chart is to provide a visual summary of how each model performs relative to the others across all the measured attributes. It\'s a useful tool for quickly assessing strengths and weaknesses of different models or systems at a glance.']

when i use the model llava-onevision-qwen2-0.5b-si,the response is correct,but when i change the model toI llava-onevision-qwen2-7b-si,I have the same response [''],can u give some advice.and this is my torh and transformers version.

open_clip_torch           2.26.1
torch                     2.1.2
torchvision               0.16.2
hf_transfer               0.1.8
transformers              4.40.0.dev0
CyrusCY commented 2 weeks ago

Just did a quick run, should be ok with si as well.

Loaded LLaVA model: lmms-lab/llava-onevision-qwen2-7b-si
Loading vision tower: google/siglip-so400m-patch14-384
Model Class: LlavaQwenForCausalLM
["This image is a radar chart that compares the performance of different models on various metrics. The models being compared are BLIP-2, InstructBLIP, and Qwen-VL-Chat. The metrics being evaluated include VQA (Visual Question Answering), QA (Question Answering), GQA (General Question Answering), SQA-IMG (Specific Question Answering with Image), and POPE (Product of Perceptual and Language Evaluation). Each model's performance is represented by a line graph, and the highest score for each metric is indicated by a red dot."]

My environment:

Package                  Version
------------------------ ------------
accelerate               0.29.3
certifi                  2024.7.4
charset-normalizer       3.3.2
einops                   0.6.1
filelock                 3.15.4
fsspec                   2024.6.1
huggingface-hub          0.24.6
idna                     3.7
Jinja2                   3.1.4
llava                    1.7.0.dev0
MarkupSafe               2.1.5
mpmath                   1.3.0
networkx                 3.3
numpy                    1.22.0
nvidia-cublas-cu12       12.1.3.1
nvidia-cuda-cupti-cu12   12.1.105
nvidia-cuda-nvrtc-cu12   12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12        9.1.0.70
nvidia-cufft-cu12        11.0.2.54
nvidia-curand-cu12       10.3.2.106
nvidia-cusolver-cu12     11.4.5.107
nvidia-cusparse-cu12     12.1.0.106
nvidia-nccl-cu12         2.20.5
nvidia-nvjitlink-cu12    12.6.20
nvidia-nvtx-cu12         12.1.105
packaging                24.1
pillow                   10.3.0
pip                      22.0.2
psutil                   6.0.0
PyYAML                   6.0.2
regex                    2024.7.24
requests                 2.32.3
safetensors              0.4.4
setuptools               59.6.0
sympy                    1.13.2
tokenizers               0.19.1
torch                    2.1.2+cu121
torchaudio               2.1.2+cu121
torchvision              0.16.2+cu121
tqdm                     4.66.5
transformers             4.40.0
triton                   2.1.0
typing_extensions        4.12.2
urllib3                  2.2.2
wheel                    0.37.1
shihy2988 commented 2 weeks ago

我在 7b 中遇到了同样的响应问题。但在我将版本更改torch为 2.1.2 和transformers4.40.0 后。响应显示正确。

Loaded LLaVA model: lmms-lab/llava-onevision-qwen2-7b-ov
Loading vision tower: google/siglip-so400m-patch14-384
Model Class: LlavaQwenForCausalLM
['The image shows a radar chart, also known as a spider chart or a star chart, which is used to compare multiple quantitative variables. Each axis represents a different variable, and the values are plotted along these axes. The chart is color-coded to represent different categories or models, with lines connecting the points for each category.\n\nIn this particular radar chart, it appears to be comparing various models or systems across several performance metrics. The labels on the axes suggest that the metrics being compared could include aspects like "MMB-Vet," "Llava-1.5," "VQA," "GQA," "SQA-IMG," "TextVQA," "MME," "BLIP-2," "InstructBLIP," "Pope," and "Qwen-VL-Chat." The numbers along the axes likely represent scores or measurements of performance for each model in those respective categories.\n\nThe purpose of such a chart is to provide a visual summary of how each model performs relative to the others across all the measured attributes. It\'s a useful tool for quickly assessing strengths and weaknesses of different models or systems at a glance.']

you are ture,thanks i have soved

rookiez7 commented 2 weeks ago

I change my transformers from 4.40.0.dev to 0 4.40.0,but the result is still [''].and I have no idea to slove this. sad...

shihy2988 commented 2 weeks ago

I change my transformers from 4.40.0.dev to 0 4.40.0,but the result is still [''].and I have no idea to slove this. sad...

i redownload the project,only trans the torch and transformer,it can do.mybe you should compare you env and the requirements.txt

rookiez7 commented 2 weeks ago

I redownload this repo,and tried transfoemers version:4.40.0.dev4.40.04.41.2,the result is still ['']. some thing i do include: All weight i use is local weight.below is my change.

  1. Meta-Llama-3-8B-Instruct:llava/conversation.py,line387, tokenizer=AutoTokenizer.from_pretrained("local_path/LLaVA-NeXT/Meta-Llama-3-8B-Instruct")

  2. siglip-so400m-patch14-384:llava-onevision-qwen2-7b-si/config.json,line176, ision_tower": "local_path/siglip-so400m-patch14-384",then some error about mismatch,I use this to fix it.https://github.com/LLaVA-VL/LLaVA-NeXT/issues/148#issuecomment-2298549964

then 0.5b model work fine,7b model result is always [''],below result is 7b model :

(llava) root@sugon:~/work/project/LLaVA-NeXT# python demo_single_image.py 
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loaded LLaVA model: /root/work/project/LLaVA-NeXT_bak/llava-onevision-qwen2-7b-si
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
You are using a model of type llava to instantiate a model of type llava_qwen. This is not supported for all configurations of models and can yield
Loading vision tower: /root/work/project/LLaVA-NeXT/siglip-so400m-patch14-384
Loading checkpoint shards: 100%|___________________________________________________________________________________________________________________
Model Class: LlavaQwenForCausalLM
['']
ThunderVVV commented 2 weeks ago

Updating to torch2.1.2 works for me. When I used torch2.0.1 before, 7b answered with many "!!!..."

rookiez7 commented 2 weeks ago

Updating to torch2.1.2 works for me. When I used torch2.0.1 before, 7b answered with many "!!!..." probable not work for me ,this is my torch version.


(llava) root@a123:~# pip list|grep torch
open_clip_torch           2.26.1
torch                     2.1.2
torchvision               0.16.2
jaca-pereira commented 2 weeks ago

I have tried all torch and transformers versions suggested, both using ov and si versions of the 7B model, and in all scenarios I simply get "!". If I turn on do_sample and use a 0.1 temperature, I get an error : File "/user/home/.conda/envs/llava/lib/python3.10/site-packages/transformers/generation/utils.py", line 2829, in _sample next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1) RuntimeError: probability tensor contains either inf, nan or element < 0. There seems to be no solution to running the 7B model, but the 0.5B model runs just fine... I would really appreciate an explanation or a fix for this problem as I can't find one.