AILab-CVC / YOLO-World

[CVPR 2024] Real-Time Open-Vocabulary Object Detection
https://www.yoloworld.cc
GNU General Public License v3.0
4.67k stars 453 forks source link

inference on video #182

Open HeChengHui opened 7 months ago

HeChengHui commented 7 months ago

I am trying to use a video frame as the input. However, i found that the code uses image path as an argument to feed into the runner. Is it possible to pass the frame instead of having to save each frame as an image then use it as input?

KingBoyAndGirl commented 7 months ago

I am trying to use a video frame as the input. However, i found that the code uses image path as an argument to feed into the runner.我正在尝试使用视频帧作为输入。但是,我发现代码使用图像路径作为参数来馈送到运行器中。 Is it possible to pass the frame instead of having to save each frame as an image then use it as input?是否可以传递帧,而不必将每个帧保存为图像,然后将其用作输入?

I also have the same demand, have you solved it?

LLH-Harward commented 7 months ago

I have the same question. Is there any solution available? Thank you.

tomgotjack commented 6 months ago

@LLH-Harward @KingBoyAndGirl 我用下面的代码规避这个问题。 我在内存中建立了一个虚拟的文件路径tmp_filename,这样runner就不需要经过磁盘了。

使用OPENCV读取视频帧,得到帧为numpy数组,将 numpy 数组转换为 PIL 图像对象

    pil_image = Image.fromarray(image)
    # 保存 PIL 图像到指定路径
    #pil_image.save(image_path)
    with tempfile.NamedTemporaryFile(delete=False, suffix='.png') as tmp_file:
        # 保存图像到临时文件
        pil_image.save(tmp_file, format='PNG')
        tmp_filename = tmp_file.name
    texts = [[t.strip()] for t in text.split(',')] + [[' ']]
    data_info = dict(img_id=0, img_path=tmp_filename, texts=texts)
wondervictor commented 6 months ago

Hi all (@HeChengHui, @KingBoyAndGirl, @LLH-Harward, @tomgotjack), the latest update has supported video inference. You can have a try! See demo/video_demo.py.

tomgotjack commented 6 months ago

@wondervictor 我运行了你提供的 deploy/onnx_demo.py,当代码运行到: for frame in track_iter_progress(video_reader): 这里会产生如下报错:

Traceback (most recent call last): File "E:\YOLO\YOLO-World\video_demo.py", line 148, in main() File "E:\YOLO\YOLO-World\video_demo.py", line 113, in main for frame in track_iter_progress(video_reader): File "D:\miniconda3\envs\yolo\lib\site-packages\mmengine\utils\progressbar.py", line 240, in track_iter_progress raise TypeError( TypeError: "tasks" must be a tuple object or a sequence object, but got <class 'mmcv.video.io.VideoReader'>

我将其替换为:

for frame in video_reader:

代码成功运行,但效率很低。我输入了一个4分钟,共5690帧1080P的视频,推理完需要2054.7499754428864 秒,也就是34分钟。有什么办法提升效率吗?

LLH-Harward commented 6 months ago

可以这样修改。 frames = [frame for frame in video_reader]

for frame in track_iter_progress(frames, file=sys.stdout): 我10s的视频跑了跑了116s image

用inference库提供的v2-x跑起来很快,但是不支持新出来的权重。

tomgotjack commented 6 months ago

@LLH-Harward 你好,想问下用inference库提供的v2-x怎么跑? 我这里使用的是自己微调之后的模型

LLH-Harward commented 6 months ago

您好,可以参照这个 用supervision+inference实现。 但是我没找到inference如何加载自己微调之后的模型。 如果有,还请您也告知下我。 https://huggingface.co/spaces/SkalskiP/YOLO-World/tree/main

tomgotjack commented 6 months ago

@LLH-Harward 谢谢,我后面看一下这个。 目前我做了一个简单的界面,可以加载视频或者调用摄像头,不过分辨率只有240P,效果如下: https://www.bilibili.com/video/BV14T421X72d/?spm_id_from=333.1365.list.card_archive.click&vd_source=0c335752a9ae5c749d91670cca8575ac

LLH-Harward commented 6 months ago

好的 请问下为什么目前分辨率只能240p呢?

tomgotjack commented 6 months ago

模型推理速度和图片分辨率有关。我实测下来240P图片可以0.09S推理,而1080P图片推理就要0.33S。 我想调用摄像头,就需要做成实时推理,对速度要求比较高。 用240P大概能做到每秒10帧。再配合抽帧,就能实现一个勉强能看的效果。如果提升分辨率,就卡的没法看了。 我的显卡是2060,如果换用好的显卡,推理速度变快,就能提升分辨率了

LLH-Harward commented 6 months ago

明白了 多谢

wondervictor commented 6 months ago

@LLH-Harward @tomgotjack 晚些时候我会提供一些解决方案优化这部分的速度。

LLH-Harward commented 6 months ago

好的,非常感谢。 另外有一个新的问题: 我使用以下命令进行推理的时候,输出的output.mp4里面没有按照我输入的text-prompt"person,book,laptop,bottle,ipad,pen,phone,bag"进行检测,使用了可能是“lvis或者obj365”的类别。(检测出很多lamp类)

请问是哪有问题?我是要修改config文件中的相关json才能是yoloworld按我输入的text推理吗?

python video_demo.py D:\YOLO-World-master\configs\pretrain\yolo_world_v2_x_vlpan_bn_2e-3_100e_4x8gpus_obj365v1_goldg_train_1280ft_lvis_minival.py D:\YOLO-World-master\pretrained_weights\yolo_world_v2_x_obj365v1_goldg_cc3mlite_pretrain_1280ft-14996a36.pth D:\YOLO-World-master\data\demo_cut.mp4 "person,book,laptop,bottle,ipad,pen,phone,bag" --out output.mp4 --score-thr 0.3

tomgotjack commented 6 months ago

@LLH-Harward 我遇到了同样的问题,不过光顾着测速度给忘了。 我用一段视频做了人车两个类别的测试,目标检测类别是对的,但输出类别是["person"]和["bicycle"]。这两个类别恰好是COCO80个类别的前两个,你可以从这里寻找一下原因

LLH-Harward commented 6 months ago

似乎是visualizer的问题 visualizer从checkpoint中获得dataset_meta visualizer = VISUALIZERS.build(model.cfg.visualizer)

the dataset_meta is loaded from the checkpoint and

# then pass to the model in init_detector

visualizer.dataset_meta = model.dataset_meta

我在mmdet\visualization\local_visualizer.py中看到 classes = self.dataset_meta.get('classes', None)

所以在visualizer中的classes直接是取的预训练中的,而不是给出的texts @wondervictor @tomgotjack

wondervictor commented 6 months ago

这个Visualizer可以更改,另外,我准备在即将的更新中不再使用Visualizer

LLH-Harward commented 6 months ago

好的,期待您的下一版更新!

HeChengHui commented 5 months ago

@wondervictor Thank you for adding video support!

I have questions regarding the model.reparameterize(texts).

  1. Do i have to run this command for every frame? does the model not get configured to that text?
  2. If i have 2 class to detect; A & B, do i have to run that command every time i detect a different class?