FoundationVision / GLEE

[CVPR2024 Highlight]GLEE: General Object Foundation Model for Images and Videos at Scale
https://glee-vision.github.io/
MIT License
1.02k stars 82 forks source link

关于visual promt #7

Open KKKSQJ opened 8 months ago

KKKSQJ commented 8 months ago

您好,GLEE是一个很棒的工作。同时,关于算法的一些细节,我有一些疑问想像您请教,如果您有空了,可以回复一下,感谢!

  1. 我使用points作为视觉提示词,GLEE是否支持负点击?能否像SAM一样使用多次点击来对一个目标进行微调?
  2. 我看您的代码实现,似乎会将points变为一个box作为提示,为什么要这么做?我没有在您的论文中找到相关的解释。
  3. 视觉提示词返回的topk_instance是否只能为1?它能否分割出一个被遮挡目标的多个部分? 感谢!
wjf5203 commented 5 months ago

Thank you for your interest in GLEE!

  1. Unfortunately, GLEE has not been trained for multi-turn interactive learning with visual prompts, so it is unable to modify the segmentation results based on multiple clicks. However, different segmentation results can be obtained by drawing different shapes on the app.
  2. We need to sample some features on the image based on the visual prompt for self-attention. To extract more information, we expand a point into a small box to standardize this process.
  3. In the visual prompt mode, the top-k can only be 1. If an object is segmented into two parts, it should still be represented by one object query. Otherwise, it will be represented as two separate objects.