Open LLH-Harward opened 5 months ago
Could you provide more details/clues about why the HF version (L-640) of YOLO-World is better than the GitHub version (X-1280)? BTW, the HuggingFace demo only uses L-640
.
Thank you for your response. I apologize if I wasn't clear earlier. I meant to point out that on Hugging Face, the models with a 1280 input seem to be more effective at detecting small objects. While Roboflow Inference and Supervision does support video processing, it currently only offers basic models like v2-l, v2-x, and lacks access to other 1280 models. Could you kindly inform me if there's a method to directly use custom weights (for instance, those obtained from training) for video inference?
Sure, I saw many requests for inferencing with videos and I'll increase the priority of it. And I'll notify you if I make it, not too long.
Thank you so much.
Hi @LLH-Harward, the latest update has supported video inference. Please check demo/video_demo.py
.
Thank you so much! I'll try it later.
hello,When I used video_demo.py for inference, the following error occurred, showing that there is no "data/coco/lvis/lvis_v1_minival_inserted_image_name.json" I found the relevant content in the model's configuration file "yolo_world_v2_x_vlpan_bn_2e-3_100e_4x8gpus_obj365v1_goldg_train_1280ft_lvis_minival.py". How to solve this problem? Can you give relevant guidance? Related configuration: torch+cu118==2.1.1 torchvision+cu118==0.16.1 mmcv==2.0.0rc4 mmdet==3.0.0 mmengine==0.10.3 mmyolo==0.6.0
BUG: inputs: python video_demo.py D:\YOLO-World-master\configs\pretrain\yolo_world_v2_x_vlpan_bn_2e-3_100e_4x8gpus_obj365v1_goldg_train_1280ft_lvis_minival.py D:\YOLO-World-master\pretrained_weights\ yolo_world_v2_x_obj365v1_goldg_cc3mlite_pretrain_1280ft-14996a36.pth D:\YOLO-World-master\result.mp4 "people,laptop,book,bottle,pen,phone" --out out111
bin C:\Users\714\AppData\Roaming\Python\Python39\site-packages\bitsandbytes\libbitsandbytes_cuda118.dll
Loads checkpoint by local backend from path: D:\YOLO-World-master\pretrained_weights\yolo_world_v2_x_obj365v1_goldg_cc3mlite_pretrain_1280ft-14996a36.pth
Traceback (most recent call last):
File "D:\YOLO-World-master\demo\video_demo.py", line 109, in
New developments: After I completed the json file and its path required in the configuration file "yolo_world_v2_x_vlpan_bn_2e-3_100e_4x8gpus_obj365v1_goldg_train_1280ft_lvis_minival.py", and modified for frame in track_iter_progress(video_reader): in video_demo.py to for frame in video_reader:, the code can now run normally and produce results.
However, the running speed is quite slow. Is it because the mmcv framework loads slowly?
The visualization takes time to draw objects in frames.
OK, thank you for your help!
Hello, thank you for your outstanding work! I would like to perform video inference directly using yolo_world, and I have used Roboflow Inference and Supervision, but they only provide some benchmark models, such as l, x, v2-x, v2-l. The model performance of the Yolo World Hugging Face model (https://huggingface.co/spaces/stevengrove/YOLO-World) is better for my purposes than the standard inference one, "yolo_world/v2-x" for example. I would like to use the weights from Hugging Face, such as x-1280, for inference. Could you please provide the necessary support? Or is it possible to directly input videos? I would greatly appreciate it.