VectorSpaceLab / Video-XL

🔥🔥First-ever hour scale video understanding models
Apache License 2.0
185 stars 12 forks source link

想请教一下demo的批量推理该如何设置 #22

Open ShbGao-ProMax opened 10 hours ago

ShbGao-ProMax commented 10 hours ago

我有一批视频想批量进行推理

我按照demo的模式编写了一个批量脚本 但我发现在输出几轮后模型就会报错

return forward_call(*args, **kwargs) File "/mnt/VLM/Video_XL/videoxl/videoxl/model/language_model/llava_qwen.py", line 152, in forward q_embed = (q * q_cos) + (rotate_half(q) * q_sin) RuntimeError: The size of tensor a (0) must match the size of tensor b (46454) at non-singleton dimension 2

我在单张A6000上进行推理 观察到每次虽然输入的视频不同但是结果相同 几轮后会报错

我怀疑是否每次推理后都没有清理history 您能否提供一些帮助?

我的批量demo脚本如下:

`if name == 'main': import os

model_path = "/mnt/checkpoint/VideoXL/VideoXL_weight_8"
video_folder = "/mnt/dataset/test_video_clip/1"

video_list = [video_folder + '/' + i for i in os.listdir(video_folder)]

max_frames_num =100 # you can change this to several thousands so long you GPU memory can handle it :)
gen_kwargs = {"do_sample": False, "temperature": 1, "top_p": None, "num_beams": 1, "use_cache": False, "max_new_tokens": 1024, "num_beams": 1}
tokenizer, model, image_processor, _ = load_pretrained_model(model_path, None, "llava_qwen", device_map="cuda:0")

model.config.beacon_ratio=[8]   # you can delete this line to realize random compression of {2,4,8} ratio

for i in video_list:
    video_path = i
    print(i)
    #video input
    prompt = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<image>\nPlease describe the video.<|im_end|>\n<|im_start|>assistant\n"
    input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(model.device)
    vr = VideoReader(video_path, ctx=cpu(0))
    total_frame_num = len(vr)
    uniform_sampled_frames = np.linspace(0, total_frame_num - 1, max_frames_num, dtype=int)
    frame_idx = uniform_sampled_frames.tolist()
    frames = vr.get_batch(frame_idx).asnumpy()
    video_tensor = image_processor.preprocess(frames, return_tensors="pt")["pixel_values"].to(model.device, dtype=torch.float16)

    beacon_skip_first = (input_ids == IMAGE_TOKEN_INDEX).nonzero(as_tuple=True)[1].item()
    num_tokens=TOKEN_PERFRAME *max_frames_num
    beacon_skip_last = beacon_skip_first  + num_tokens

    with torch.inference_mode():
        output_ids = model.generate(input_ids, images=[video_tensor],  modalities=["video"],beacon_skip_first=beacon_skip_first,beacon_skip_last=beacon_skip_last, **gen_kwargs)

        if IMAGE_TOKEN_INDEX in input_ids:
            transform_input_ids=transform_input_id(input_ids,num_tokens,model.config.vocab_size-1)
            output_ids=output_ids[:,transform_input_ids.shape[1]:]
            outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
    print("#####################################")
    print(outputs)
    print("#####################################")`
ShbGao-ProMax commented 9 hours ago

我尝试每轮都重新加载模型 似乎顺利了一些 但是在推理十几个视频后遇到了 TypeError: sequence item 109: expected str instance, NoneType found 请问您有什么建议吗

shuyansy commented 7 hours ago

您好,可以在for循环里面加上这个代码。 ‘’‘ model.memory.reset() ’‘’ 每轮重新加载模型理论上是可行的,有可能是那条数据的问题?能否单独对有问题的数据测试下。