Open shihy2988 opened 3 weeks ago
meet the same problem when using qwen2-7b-ov to inference, how to fix this? In the LLaVA wetchat group, someone says it may due to the version of transformers library, and a guy has updated its transformers library version to 4.41.2, then the response changes to [''], still not correct
I checked the output tensor of the decoder layers, and found that after the last decoder layer, the features of the last token were all "nans", this is very weird. Because of this problem, I failed to reproduce the results on the video benchmarks. Does anyone know how to solve this problem? @ZhangYuanhan-AI @Luodian
Here's my reproduced result with ov-7b
.
My code is:
from operator import attrgetter
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llava.conversation import conv_templates, SeparatorStyle
import torch
import cv2
import numpy as np
from PIL import Image
import requests
import copy
import warnings
from decord import VideoReader, cpu
warnings.filterwarnings("ignore")
# Load the OneVision model
pretrained = "lmms-lab/llava-onevision-qwen2-7b-ov"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map, attn_implementation="sdpa")
model.eval()
# Function to extract frames from video
def load_video(video_path, max_frames_num):
if type(video_path) == str:
vr = VideoReader(video_path, ctx=cpu(0))
else:
vr = VideoReader(video_path[0], ctx=cpu(0))
total_frame_num = len(vr)
uniform_sampled_frames = np.linspace(0, total_frame_num - 1, max_frames_num, dtype=int)
frame_idx = uniform_sampled_frames.tolist()
spare_frames = vr.get_batch(frame_idx).asnumpy()
return spare_frames # (frames, height, width, channels)
# Load and process video
video_path = "jobs.mp4"
video_frames = load_video(video_path, 16)
print(video_frames.shape) # (16, 1024, 576, 3)
image_tensors = []
frames = image_processor.preprocess(video_frames, return_tensors="pt")["pixel_values"].half().cuda()
image_tensors.append(frames)
# Prepare conversation input
conv_template = "qwen_1_5"
question = f"{DEFAULT_IMAGE_TOKEN}\nDescribe what's happening in this video."
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
image_sizes = [frame.size for frame in video_frames]
# Generate response
cont = model.generate(
input_ids,
images=image_tensors,
image_sizes=image_sizes,
do_sample=False,
temperature=0,
max_new_tokens=4096,
modalities=["video"],
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
print(text_outputs[0])
Maybe you installed wrong version llava? like installed llava 1.5 instead of our current version llava? You can try to clean conda and install llava.
I've met the same response issue in 7b. But after I changed the torch
version to 2.1.2 and transformers
version to 4.40.0. The response shows correctly.
Loaded LLaVA model: lmms-lab/llava-onevision-qwen2-7b-ov
Loading vision tower: google/siglip-so400m-patch14-384
Model Class: LlavaQwenForCausalLM
['The image shows a radar chart, also known as a spider chart or a star chart, which is used to compare multiple quantitative variables. Each axis represents a different variable, and the values are plotted along these axes. The chart is color-coded to represent different categories or models, with lines connecting the points for each category.\n\nIn this particular radar chart, it appears to be comparing various models or systems across several performance metrics. The labels on the axes suggest that the metrics being compared could include aspects like "MMB-Vet," "Llava-1.5," "VQA," "GQA," "SQA-IMG," "TextVQA," "MME," "BLIP-2," "InstructBLIP," "Pope," and "Qwen-VL-Chat." The numbers along the axes likely represent scores or measurements of performance for each model in those respective categories.\n\nThe purpose of such a chart is to provide a visual summary of how each model performs relative to the others across all the measured attributes. It\'s a useful tool for quickly assessing strengths and weaknesses of different models or systems at a glance.']
Maybe u could use the same requirements.txt in the main branch.
I've met the same response issue in 7b. But after I changed the
torch
version to 2.1.2 andtransformers
version to 4.40.0. The response shows correctly.Loaded LLaVA model: lmms-lab/llava-onevision-qwen2-7b-ov Loading vision tower: google/siglip-so400m-patch14-384 Model Class: LlavaQwenForCausalLM ['The image shows a radar chart, also known as a spider chart or a star chart, which is used to compare multiple quantitative variables. Each axis represents a different variable, and the values are plotted along these axes. The chart is color-coded to represent different categories or models, with lines connecting the points for each category.\n\nIn this particular radar chart, it appears to be comparing various models or systems across several performance metrics. The labels on the axes suggest that the metrics being compared could include aspects like "MMB-Vet," "Llava-1.5," "VQA," "GQA," "SQA-IMG," "TextVQA," "MME," "BLIP-2," "InstructBLIP," "Pope," and "Qwen-VL-Chat." The numbers along the axes likely represent scores or measurements of performance for each model in those respective categories.\n\nThe purpose of such a chart is to provide a visual summary of how each model performs relative to the others across all the measured attributes. It\'s a useful tool for quickly assessing strengths and weaknesses of different models or systems at a glance.']
when i use the model llava-onevision-qwen2-0.5b-si
,the response is correct,but when i change the model toI llava-onevision-qwen2-7b-si
,I have the same response [''],can u give some advice.and this is my torh and transformers version.
open_clip_torch 2.26.1
torch 2.1.2
torchvision 0.16.2
hf_transfer 0.1.8
transformers 4.40.0.dev0
Just did a quick run, should be ok with si as well.
Loaded LLaVA model: lmms-lab/llava-onevision-qwen2-7b-si
Loading vision tower: google/siglip-so400m-patch14-384
Model Class: LlavaQwenForCausalLM
["This image is a radar chart that compares the performance of different models on various metrics. The models being compared are BLIP-2, InstructBLIP, and Qwen-VL-Chat. The metrics being evaluated include VQA (Visual Question Answering), QA (Question Answering), GQA (General Question Answering), SQA-IMG (Specific Question Answering with Image), and POPE (Product of Perceptual and Language Evaluation). Each model's performance is represented by a line graph, and the highest score for each metric is indicated by a red dot."]
My environment:
Package Version
------------------------ ------------
accelerate 0.29.3
certifi 2024.7.4
charset-normalizer 3.3.2
einops 0.6.1
filelock 3.15.4
fsspec 2024.6.1
huggingface-hub 0.24.6
idna 3.7
Jinja2 3.1.4
llava 1.7.0.dev0
MarkupSafe 2.1.5
mpmath 1.3.0
networkx 3.3
numpy 1.22.0
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.20.5
nvidia-nvjitlink-cu12 12.6.20
nvidia-nvtx-cu12 12.1.105
packaging 24.1
pillow 10.3.0
pip 22.0.2
psutil 6.0.0
PyYAML 6.0.2
regex 2024.7.24
requests 2.32.3
safetensors 0.4.4
setuptools 59.6.0
sympy 1.13.2
tokenizers 0.19.1
torch 2.1.2+cu121
torchaudio 2.1.2+cu121
torchvision 0.16.2+cu121
tqdm 4.66.5
transformers 4.40.0
triton 2.1.0
typing_extensions 4.12.2
urllib3 2.2.2
wheel 0.37.1
我在 7b 中遇到了同样的响应问题。但在我将版本更改
torch
为 2.1.2 和transformers
4.40.0 后。响应显示正确。Loaded LLaVA model: lmms-lab/llava-onevision-qwen2-7b-ov Loading vision tower: google/siglip-so400m-patch14-384 Model Class: LlavaQwenForCausalLM ['The image shows a radar chart, also known as a spider chart or a star chart, which is used to compare multiple quantitative variables. Each axis represents a different variable, and the values are plotted along these axes. The chart is color-coded to represent different categories or models, with lines connecting the points for each category.\n\nIn this particular radar chart, it appears to be comparing various models or systems across several performance metrics. The labels on the axes suggest that the metrics being compared could include aspects like "MMB-Vet," "Llava-1.5," "VQA," "GQA," "SQA-IMG," "TextVQA," "MME," "BLIP-2," "InstructBLIP," "Pope," and "Qwen-VL-Chat." The numbers along the axes likely represent scores or measurements of performance for each model in those respective categories.\n\nThe purpose of such a chart is to provide a visual summary of how each model performs relative to the others across all the measured attributes. It\'s a useful tool for quickly assessing strengths and weaknesses of different models or systems at a glance.']
you are ture,thanks i have soved
I change my transformers
from 4.40.0.dev
to 0 4.40.0
,but the result is still [''].and I have no idea to slove this.
sad...
I change my
transformers
from4.40.0.dev
to0 4.40.0
,but the result is still [''].and I have no idea to slove this. sad...
i redownload the project,only trans the torch and transformer,it can do.mybe you should compare you env and the requirements.txt
I redownload this repo,and tried transfoemers
version:4.40.0.dev
、4.40.0
、4.41.2
,the result is still ['']
.
some thing i do include:
All weight i use is local weight.below is my change.
Meta-Llama-3-8B-Instruct
:llava/conversation.py,line387,
tokenizer=AutoTokenizer.from_pretrained("local_path/LLaVA-NeXT/Meta-Llama-3-8B-Instruct")
siglip-so400m-patch14-384
:llava-onevision-qwen2-7b-si/config.json,line176,
ision_tower": "local_path/siglip-so400m-patch14-384",then some error about mismatch,I use this to fix it.https://github.com/LLaVA-VL/LLaVA-NeXT/issues/148#issuecomment-2298549964
then 0.5b model work fine,7b model result is always ['']
,below result is 7b model :
(llava) root@sugon:~/work/project/LLaVA-NeXT# python demo_single_image.py
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loaded LLaVA model: /root/work/project/LLaVA-NeXT_bak/llava-onevision-qwen2-7b-si
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
You are using a model of type llava to instantiate a model of type llava_qwen. This is not supported for all configurations of models and can yield
Loading vision tower: /root/work/project/LLaVA-NeXT/siglip-so400m-patch14-384
Loading checkpoint shards: 100%|___________________________________________________________________________________________________________________
Model Class: LlavaQwenForCausalLM
['']
Updating to torch2.1.2 works for me. When I used torch2.0.1 before, 7b answered with many "!!!..."
Updating to torch2.1.2 works for me. When I used torch2.0.1 before, 7b answered with many "!!!..." probable not work for me ,this is my torch version.
(llava) root@a123:~# pip list|grep torch open_clip_torch 2.26.1 torch 2.1.2 torchvision 0.16.2
I have tried all torch and transformers versions suggested, both using ov and si versions of the 7B model, and in all scenarios I simply get "!". If I turn on do_sample and use a 0.1 temperature, I get an error : File "/user/home/.conda/envs/llava/lib/python3.10/site-packages/transformers/generation/utils.py", line 2829, in _sample next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1) RuntimeError: probability tensor contains either inf, nan or element < 0
. There seems to be no solution to running the 7B model, but the 0.5B model runs just fine... I would really appreciate an explanation or a fix for this problem as I can't find one.
0.5b response is norm but 7b wrong
the same image,where i chage the code is
pretrained = "/home/shihongyu/MMLM_models/lmms-lab/llava-onevision-qwen2-7b-ov" model_name = "llava_qwen" device = "cuda" device_map = "auto"
but response is diff 0.5b response:['The image shows a conveyor belt in an industrial setting, likely part of a factory or processing plant. The conveyor is moving materials along its path, and there are some mechanical components visible on the left side of the frame.'] 7b response :['!']