Open AlphaNext opened 4 months ago
我也遇到过这个问题,这个是因为inference.py没传attention mask和pad token id。参考videollama2/__init__.py
里面的infer函数可以改成:
input_ids = tokenizer_MMODAL_token(prompt, tokenizer, modal_index, return_tensors='pt').unsqueeze(0).cuda()
attention_masks = input_ids.ne(tokenizer.pad_token_id).long().cuda()
# 3. generate response according to visual signals and prompts.
stop_str = conv.sep if conv.sep_style in [SeparatorStyle.SINGLE] else conv.sep2
# keywords = ["<s>", "</s>"]
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
do_sample=True
with torch.inference_mode():
output_ids = model.generate(
input_ids,
attention_mask=attention_masks,
images_or_videos=tensor,
modal_list=modals,
do_sample=do_sample,
temperature=0.2 if do_sample else 0.0,
max_new_tokens=1024,
use_cache=True,
stopping_criteria=[stopping_criteria],
pad_token_id=tokenizer.eos_token_id,
)
我自己这边测试是有效的
前向使用的代码如下,只改了questions和视频路径paths,具体的questions是:"Generate a brief and accurate for this video"
import torch import transformers import sys sys.path.append('./') from videollama2.conversation import conv_templates from videollama2.constants import DEFAULT_MMODAL_TOKEN, MMODAL_TOKEN_INDEX from videollama2.mm_utils import get_model_name_from_path, tokenizer_MMODAL_token, process_video, process_image from videollama2.model.builder import load_pretrained_model def inference(): # Video Inference paths = ['assets/cat_and_chicken.mp4'] questions = ['What animals are in the video, what are they doing, and how does the video feel?'] # Reply: # The video features a kitten and a baby chick playing together. The kitten is seen laying on the floor while the baby chick hops around. The two animals interact playfully with each other, and the video has a cute and heartwarming feel to it. modal_list = ['video'] # Image Inference paths = ['assets/sora.png'] questions = ['What is the woman wearing, what is she doing, and how does the image feel?'] # Reply: # The woman in the image is wearing a black coat and sunglasses, and she is walking down a rain-soaked city street. The image feels vibrant and lively, with the bright city lights reflecting off the wet pavement, creating a visually appealing atmosphere. The woman's presence adds a sense of style and confidence to the scene, as she navigates the bustling urban environment. modal_list = ['image'] # 1. Initialize the model. model_path = 'DAMO-NLP-SG/VideoLLaMA2-7B' # Base model inference (only need to replace model_path) # model_path = 'DAMO-NLP-SG/VideoLLaMA2-7B-Base' model_name = get_model_name_from_path(model_path) tokenizer, model, processor, context_len = load_pretrained_model(model_path, None, model_name) model = model.to('cuda:0') conv_mode = 'llama_2' # 2. Visual preprocess (load & transform image or video). if modal_list[0] == 'video': tensor = process_video(paths[0], processor, model.config.image_aspect_ratio).to(dtype=torch.float16, device='cuda', non_blocking=True) default_mm_token = DEFAULT_MMODAL_TOKEN["VIDEO"] modal_token_index = MMODAL_TOKEN_INDEX["VIDEO"] else: tensor = process_image(paths[0], processor, model.config.image_aspect_ratio)[0].to(dtype=torch.float16, device='cuda', non_blocking=True) default_mm_token = DEFAULT_MMODAL_TOKEN["IMAGE"] modal_token_index = MMODAL_TOKEN_INDEX["IMAGE"] tensor = [tensor] # 3. text preprocess (tag process & generate prompt). question = default_mm_token + "\n" + questions[0] conv = conv_templates[conv_mode].copy() conv.append_message(conv.roles[0], question) conv.append_message(conv.roles[1], None) prompt = conv.get_prompt() input_ids = tokenizer_MMODAL_token(prompt, tokenizer, modal_token_index, return_tensors='pt').unsqueeze(0).to('cuda:0') with torch.inference_mode(): output_ids = model.generate( input_ids, images_or_videos=tensor, modal_list=modal_list, do_sample=True, temperature=0.2, max_new_tokens=1024, use_cache=True, ) outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True) print(outputs[0]) if __name__ == "__main__": inference()
频繁输出的可疑日志如下:
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:2 for open-end generation. The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:2 for open-end generation. The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:2 for open-end generation. The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:2 for open-end generation. The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:2 for open-end generation. The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:2 for open-end generation. The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Hello,I'm a phD student from ZJU, I also use videollama2 to do my own research,we create a WeChat group to discuss some issues of videollama2 and help each other,could you join us? Please contact me: WeChat number == LiangMeng19357260600, phone number == +86 19357260600,e-mail == liangmeng89@zju.edu.cn.
我也遇到过这个问题,这个是因为inference.py没传attention mask和pad token id。参考
videollama2/__init__.py
里面的infer函数可以改成:input_ids = tokenizer_MMODAL_token(prompt, tokenizer, modal_index, return_tensors='pt').unsqueeze(0).cuda() attention_masks = input_ids.ne(tokenizer.pad_token_id).long().cuda() # 3. generate response according to visual signals and prompts. stop_str = conv.sep if conv.sep_style in [SeparatorStyle.SINGLE] else conv.sep2 # keywords = ["<s>", "</s>"] keywords = [stop_str] stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids) do_sample=True with torch.inference_mode(): output_ids = model.generate( input_ids, attention_mask=attention_masks, images_or_videos=tensor, modal_list=modals, do_sample=do_sample, temperature=0.2 if do_sample else 0.0, max_new_tokens=1024, use_cache=True, stopping_criteria=[stopping_criteria], pad_token_id=tokenizer.eos_token_id, )
我自己这边测试是有效的
Hello,I'm a phD student from ZJU, I also use videollama2 to do my own research,we create a WeChat group to discuss some issues of videollama2 and help each other,could you join us? Please contact me: WeChat number == LiangMeng19357260600, phone number == +86 19357260600,e-mail == liangmeng89@zju.edu.cn.
前向使用的代码如下,只改了questions和视频路径paths,具体的questions是:"Generate a brief and accurate for this video"
频繁输出的可疑日志如下: