IDEA-Research / T-Rex

API for T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy
https://deepdataspace.com/home
Other
1.98k stars 120 forks source link

Why OVP Performance Enhancement Over IVP #72

Closed thfylsty closed 17 hours ago

thfylsty commented 1 week ago

I have try the OVP feature on the demo web, and the effect is much better than IVP. I am very curious about how this is achieved, and whether there has been any fine-tuning training? I guess that after inferring the image to obtain the correct and incorrect bounding boxes first iter, the incorrect bbox is used as multi-class negative samples to prompt? Then, we can try repeated iterative until all objs are detected right. But how are multiple positive samples handled? If the prompts.mean() is used, would it lose a lot of features? I guess it might be to aggregate many prompts into serveral prompts? After infer the mode obtaining the results, several prompts are merged into one class for output show, while the remaining prompts serve as negative samples and are not showed. But ...emm, I think it may not right. Thank you for your help, and I look forward to your reply.

Mountchicken commented 1 week ago

Hi @thfylsty

OVP essentially involves the user providing fully annotated images, and then we fine-tune T-Rex2 on these images to overfit to the category provided by the user.

thfylsty commented 1 week ago

get , thank you for replay. But... This is inconsistent with my understanding. Did you perhaps misunderstand my question for OVP not others? Upon checking the official website, I found the introduction to OVP, which mentions no fine-tuning and parameters remaining unchanged.

there is the introduction from https://deepdataspace.com/playground/ovp 《Optimized Visual Prompt 基于视觉提示而不是微调模型的通用检测》 《Optimized Visual Prompt 由 IDEA-CVR 主导的 Optimized Visual Prompt 研究成果,将文本提示词的概念延申到视觉提示,解决了文本难以精准描述视觉特征的痛点。只要提交少量(1-5张)带有目标物体标注的图片,5分钟内就能自动生成最匹配的视觉提示,在始终保持模型参数不变的情况下,达到跨图片、高精度的检测效果。》

Mountchicken commented 6 days ago

Sorry for the misunderstanding. We will not finetune the model. Instead, we will only initialize an embedding, train this embedding, and then replace the visual prompt embedding with it.

thfylsty commented 17 hours ago

get , thank you .