Open ditengm opened 12 months ago
Describe the image in detail. Or just try our new model Qwen-VL-plus in readme.
@ditengm If you don't want any box-like annotations in the response, you can stably get the cleaned text by the following post-processing.
# response = '<ref> Two apples</ref><box>(302,257),(582,671)</box><box>(603,252),(878,642)</box> and<ref> a bowl</ref><box>(2,269),(304,674)</box>'
import re
clean_response = re.sub(r'<ref>(.*?)</ref>(?:<box>.*?</box>)*(?:<quad>.*?</quad>)*', r'\1', response).strip()
print(clean_response)
# clean_response = 'Two apples and a bowl'
起始日期 | Start Date
12.7.2023
实现PR | Implementation PR
-
相关Issues | Reference Issues
What prompt is needed to ensure that the model does not return detected objects?
摘要 | Summary
I use several promts so that the model simply describes the objects.
What prompt do I need so that the model does not return a detection, but returns a detailed response? MODEL_NAME = '4bit/Qwen-VL-Chat-Int4'
基本示例 | Basic Example
Examples: text_1 = 'You can write only in English. Step by step describe the all objects (environment, emotions, devices and other things) in the image' text_2 = 'You can write only in English. Step by step describe the all (environment, emotions, devices and other things) in the image' text_3 = 'You can write only in English. Step by step describe it' text_4 = 'You can only write in English. Describe everything (environment, emotions, device, etc.) in the image step by step and in detail.'
In all prompts, the model gives detection to 7 photos out of 9. I don’t need this, I just want to get an answer without detection inside.
缺陷 | Drawbacks
-
未解决问题 | Unresolved questions
-