OpenGVLab / InternVideo

[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
Apache License 2.0
1.4k stars 85 forks source link

InternVideo2_chat_8B_HD cannot load llm properly? #154

Closed sszzsupersupersupersuper closed 2 months ago

sszzsupersupersupersuper commented 2 months ago

I was trying to run this demo on the model card https://huggingface.co/OpenGVLab/InternVideo2_chat_8B_HD

yet I obtained the warning " Some weights of the model checkpoint at my_local_model_path/ were not used when initializing InternVideo2_VideoChat2: ['lm.base_model.model.lm_head.weight', 'lm.base_m...“

and the output of the demo became all "\\" in the end. I could not find anything that might cause this issue. Anyone got the same issue like this? how did you solve it?

yinanhe commented 2 months ago

If it is convenient, please tell me your version of peft. We recommend using 0.5.0

sszzsupersupersupersuper commented 2 months ago

Thanks for quick response! yeah I was using peft 0.5.0, It might be because the path of pretrained models of bert. After I changed to local_dir then it worked like a charm! Thanks again!

Divyanshupy commented 2 months ago

Hey I am facing the same issue and the response is coming as null character? What did you change @sszzsupersupersupersuper ? The code is mentioned below:


import os
try:
  token =os.environ['HF_TOKEN']
except:
  print("paste your hf token here!")
  token = "entertoken"
  os.environ['HF_TOKEN'] = token
import torch
# import gradio as gr
# from gradio.themes.utils import colors, fonts, sizes

from transformers import AutoTokenizer, AutoModel

# ========================================
#             Model Initialization
# ========================================

tokenizer =  AutoTokenizer.from_pretrained('OpenGVLab/InternVideo2-Chat-8B',
    trust_remote_code=True,
    use_fast=False,
    token=token)
if torch.cuda.is_available():
  model = AutoModel.from_pretrained(
      'OpenGVLab/InternVideo2-Chat-8B',
      torch_dtype=torch.bfloat16,
      trust_remote_code=True).cuda()
else:
  model = AutoModel.from_pretrained(
      'OpenGVLab/InternVideo2-Chat-8B',
      torch_dtype=torch.bfloat16,
      trust_remote_code=True)

from decord import VideoReader, cpu
from PIL import Image
import numpy as np
import numpy as np
import decord
from decord import VideoReader, cpu
import torch.nn.functional as F
import torchvision.transforms as T
from torchvision.transforms import PILToTensor
from torchvision import transforms
from torchvision.transforms.functional import InterpolationMode
decord.bridge.set_bridge("torch")

# ========================================
#          Define Utils
# ========================================
def get_index(num_frames, num_segments):
    seg_size = float(num_frames - 1) / num_segments
    start = int(seg_size / 2)
    offsets = np.array([
        start + int(np.round(seg_size * idx)) for idx in range(num_segments)
    ])
    return offsets

def load_video(video_path, num_segments=8, return_msg=False, resolution=224, hd_num=4, padding=False):
    decord.bridge.set_bridge("torch")
    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
    num_frames = len(vr)
    frame_indices = get_index(num_frames, num_segments)

    mean = (0.485, 0.456, 0.406)
    std = (0.229, 0.224, 0.225)

    transform = transforms.Compose([
        transforms.Lambda(lambda x: x.float().div(255.0)),
        transforms.Resize(224, interpolation=transforms.InterpolationMode.BICUBIC),
        transforms.CenterCrop(224),
        transforms.Normalize(mean, std)
    ])

    frames = vr.get_batch(frame_indices)
    frames = frames.permute(0, 3, 1, 2)
    frames = transform(frames)

    T_, C, H, W = frames.shape

    if return_msg:
        fps = float(vr.get_avg_fps())
        sec = ", ".join([str(round(f / fps, 1)) for f in frame_indices])
        # " " should be added in the start and end
        msg = f"The video contains {len(frame_indices)} frames sampled at {sec} seconds."
        return frames, msg
    else:
        return frames

video_path = "example1.mp4"
# sample uniformly 8 frames from the video
video_tensor = load_video(video_path, num_segments=8, return_msg=False)
video_tensor = video_tensor.to(model.device)

chat_history= []
response, chat_history = model.chat(tokenizer, '', 'describe the action step by step.', media_type='video', media_tensor=video_tensor, chat_history= chat_history, return_history=True,generation_config={'do_sample':False})
print(response)
# The video shows a woman performing yoga on a rooftop with a beautiful view of the mountains in the background. She starts by standing on her hands and knees, then moves into a downward dog position, and finally ends with a standing position. Throughout the video, she maintains a steady and fluid movement, focusing on her breath and alignment. The video is a great example of how yoga can be practiced in different environments and how it can be a great way to connect with nature and find inner peace.

response, chat_history = model.chat(tokenizer, '', 'What is man wearing', media_type='video', media_tensor=video_tensor, chat_history= chat_history, return_history=True,generation_config={'do_sample':False})
# # The woman in the video is wearing a black tank top and grey yoga pants.
# print(response)'
yinanhe commented 2 months ago

@Divyanshupy Hi, Please check your peft version and make sure it is 0.5.0

Divyanshupy commented 2 months ago

Wow!! It worked like a charm. Thank you again for the great work!!