THUDM / GLM-4

GLM-4 series: Open Multilingual Multimodal Chat LMs | 开源多语言多模态对话模型
Apache License 2.0
5.28k stars 435 forks source link

GLM-4V-9B Bounding Box #497

Closed Stanleyluuuu closed 2 months ago

Stanleyluuuu commented 2 months ago

System Info / 系統信息

Hi,

I'm using GLM-4v-9B to develop a feature that allows users to input an image and receive the corresponding bounding box. For example, the prompt might be: "Is there any person fall down? Give me the bounding box in (x1, y1, x2, y2) format if exists."

However, I noticed that the bounding box does not fully enclose the person who has fallen. Could you provide any guidance or instructions regarding the bounding box output?

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

Reproduction / 复现过程

  1. Input an image and give the prompt "Is there any person fall down? Give me the bounding box in (x1, y1, x2, y2) format if exists."
  2. Plot the output bounding box on the image.

Expected behavior / 期待表现

I expect to understand how to guide the model to output bounding box coordinate in the format I want.

zRzRzRzRzRzRzR commented 2 months ago

This model hasn’t been trained for grounding, so it doesn’t effectively output bounding boxes (bbx) for grounding tasks.

A good suggestion would be to fine-tune the model using a labeled dataset, like the one you mentioned with bbx, to improve its grounding capabilities. However, this process can be complex, particularly in terms of preparing the dataset, which poses a significant challenge.

Stanleyluuuu commented 2 months ago

OK, I understand. Thanks for the clear explanation.