AILab-CVC / YOLO-World

[CVPR 2024] Real-Time Open-Vocabulary Object Detection
https://www.yoloworld.cc
GNU General Public License v3.0
4.05k stars 391 forks source link

如何通过指定text,进行指定类别的检测 #181

Open Hongyuan-Liu opened 4 months ago

Hongyuan-Liu commented 4 months ago

我用如下命令导出onnx模型: python deploy/export_onnx.py \ ./configs/pretrain/yolo_world_v2_l_vlpan_bn_2e-3_100e_4x8gpus_obj365v1_goldg_train_lvis_minival.py \ ./weights/yolo_world_v2_l_obj365v1_goldg_cc3mlite_pretrain-ca93cd1f.pth \ --custom-text data/texts/obj365v1_class_texts.json \ --opset 12

然后推理代码如下: import onnxruntime import numpy as np import cv2 import copy

def letterbox(src_image, dst_width=640, dst_height=640, color=114): src_height, srcwidth, = src_image.shape padding_x = 0 padding_y = 0 scale = 1.0

value = (color, color, color)

scale_width = src_width / dst_width
scale_height = src_height / dst_height

if scale_width >= scale_height:
    scale = scale_width
    tmp_h = scale * dst_height
    padding_y = int(abs((tmp_h - src_height) / 2.0))
    padding_image = cv2.copyMakeBorder(src_image, padding_y, padding_y, 0, 0, cv2.BORDER_CONSTANT, value)
elif scale_height > scale_width:
    scale = scale_height
    tmp_w = scale * dst_width
    padding_x = int(abs((tmp_w - src_width) / 2.0))
    padding_image = cv2.copyMakeBorder(src_image, 0, 0, padding_x, padding_x, cv2.BORDER_CONSTANT, value)

dst_image = cv2.resize(padding_image, (dst_width, dst_height), interpolation=cv2.INTER_LINEAR)

offset_x = padding_x
offset_y = padding_y
scale_val = scale

return dst_image, offset_x, offset_y, scale_val

def convert_result(src_image, boxes, offset_x, offset_y, scale_val): src_width = src_image.shape[1] src_height = src_image.shape[0]

new_boxes = []
for box in boxes:
    new_box = []
    x1, y1, x2, y2 = box[0], box[1], box[2], box[3]

    x1 = int(x1 * scale_val - offset_x)
    y1 = int(y1 * scale_val - offset_y)
    x2 = int(x2 * scale_val - offset_x)
    y2 = int(y2 * scale_val - offset_y)

    if x1 <= 0:
        x1 = 1
    elif x1 >= src_width:
        x1 = src_width - 1

    if y1 <= 0:
        y1 = 1
    elif y1 >= src_height:
        y1 = src_height - 1

    if x2 <= 0:
        x2 = 1
    elif x2 >= src_width:
        x2 = src_width - 1

    if y2 <= 0:
        y2 = 1
    elif y2 >= src_height:
        y2 = src_height - 1

    new_box.append(x1)
    new_box.append(y1)
    new_box.append(x2)
    new_box.append(y2)
    new_boxes.append(new_box)

return new_boxes

def draw_result(src_image, labels, bboxes, scores): src_image_h, src_imagew, = src_image.shape for label, box, score in zip(labels, bboxes, scores): if label == -1: continue x1, y1, x2, y2 = list(map(int, box)) np.random.seed(int(label) + 2000) box_color = (np.random.randint(0, 255), np.random.randint(0, 255), np.random.randint(0, 255)) cv2.rectangle(src_image, (x1, y1), (x2, y2), box_color, max(int((src_image_w + src_image_h) / 1000), 2), cv2.LINE_AA) content = str(label) + ' ' + '{0:.3f}'.format(score) font_scale = round(0.002 * ((x2 - x1) + (y2 - y1)) / 2) + 1 text_size = cv2.getTextSize(content, 0, fontScale=font_scale / 3, thickness=1)[0] cv2.rectangle(src_image, (x1 + 2, y1 + 2), (x1 + text_size[0] + 3, y1 + text_size[1] + 5), (0, 0, 0), cv2.FILLED, cv2.LINE_AA) cv2.putText(src_image, content, (x1 + 1, y1 + 16), 0, font_scale / 3, [255, 255, 255], thickness=1, lineType=cv2.LINE_AA)

return src_image

if name == 'main':

onnx_file = 'yolow-l.onnx'

onnx_file = './work_dir/yolo_world_v2_l_obj365v1_goldg_cc3mlite_pretrain-ca93cd1f.onnx'
image_file = 'data/images/bus.jpg'
save_image_file = './result.jpg'

src_image = cv2.imread(image_file)
output_image = copy.deepcopy(src_image)
image, offset_x, offset_y, scale_val = letterbox(src_image)

image = image.astype(np.float32) / 255.0
image = np.transpose(image, (2, 0, 1))
image = np.expand_dims(image, axis=0)

session = onnxruntime.InferenceSession(onnx_file)
input_name = session.get_inputs()[0].name
output_names = [o.name for o in session.get_outputs()]
outputs = session.run(output_names, {input_name: image})

print(outputs)

num_dets = outputs[0][0][0]
bboxes = outputs[1][0]
scores = outputs[2][0]
labels = outputs[3][0]

bboxes = convert_result(output_image, bboxes, offset_x, offset_y, scale_val)
result_image = draw_result(output_image, labels, bboxes, scores)
cv2.imwrite(save_image_file, result_image)

图像结果: result 我想知道,如何通过指定text,进行指定类别的检测,而不是输出所有的类别的结果,是否现在官方onnx的导出只支持所有类别作为输出呢?在实际项目中这种方式该如何应用,是对结果进行过滤,来获取指定类别的结果吗?

wufei-png commented 4 months ago

看一下reparameterize这个函数

Hongyuan-Liu commented 4 months ago

我看到了,这个函数是提前给模型输入了文本信息,那么onnx推理的时候也能调用到这个函数吗?

wufei-png commented 4 months ago

调不到,得改一下代码,默认的onnx导出的模型输入只有图片,需要加上text提示才能实现你说的:

这个函数是提前给模型输入了文本信息,那么onnx推理的时候也能调用到这个函数

Hongyuan-Liu commented 4 months ago

那就是在导出onnx的时候,fake_input包括图片、text,一块送进去进行导出了 torch.onnx.export( deploy_model, fake_input, f, input_names=['images'], output_names=output_names, opset_version=args.opset) 但为什么官方的导出方式不这么做呢?

wondervictor commented 4 months ago

Hi @Hongyuan-Liu, currently, we only support exporting ONNX with a customized vocabulary for downstream detection tasks without training. Supporting text prompts for the ONNX model requires the text encoder. Ideally, we can export a larger ONNX model with a text encoder, but currently, it's not in the plan.

wondervictor commented 4 months ago

@Hongyuan-Liu, you can try to export the ONNX model with CLIP. There are some open-source works which export CLIP to ONNX. We would appreciate it if you finish it with a kind pull request.

Hongyuan-Liu commented 4 months ago

根据你的所述,我理解的是,这个工程只是做特定语义的检测,也就是所说的自定义语义检测,当要部署的时候,把自定的所有语义都通过reparameterize加载进去,然后导出onnx,这样onnx模型就支持了所定义的语义检测了,是这样吗? 那如果是我理解的这样的话,我自己指定自定义的语义,而不限于 coco_class_texts.json,lvis_v1_base_class_captions.json ,lvis_v1_class_texts.json, obj365v1_class_texts.json这些定义好的,理论上也是可以的是吗 ?

wondervictor commented 4 months ago

是的,一般检测器训练多少类就只能检测多少类,部署也只能检测这些类。对于YOLO-World,你现在可以指定你想要检测的类别(custom vocabulary),然后reparameterize进模型,转ONNX后部署。custom vocabulary可以不限于这些json的类别,可以自己写一个json来自定义任何类别。

Hongyuan-Liu commented 4 months ago

好的,谢谢

wufei-png commented 3 months ago

Hi @Hongyuan-Liu, currently, we only support exporting ONNX with a customized vocabulary for downstream detection tasks without training. Supporting text prompts for the ONNX model requires the text encoder. Ideally, we can export a larger ONNX model with a text encoder, but currently, it's not in the plan.

for this model: yolo_world_l_t2i_bn_2e-4_100e_4x8gpus_obj365v1_goldg_train_lvis_minival.py,the text encoder is

        text_model=dict(
            type='HuggingCLIPLanguageBackbone',
            model_name='openai/clip-vit-base-patch32',
            frozen_modules=['all']))

this seem not difficult to add this model to onnx. Is it because of the less importance of this feature or the size of the model that it was not in plan? The inference time is also one reason. But this will align with the provided demo: you can specify the text for inference and change it in next inference, like this, so can be more dynamic: image