UX-Decoder / DINOv

[CVPR 2024] Official implementation of the paper "Visual In-context Learning"
395 stars 17 forks source link

'evaluate_demo_content_openset_multi_with_content_features' and 'evaluate_visual_prompt_refer_multi_with_content_features' #29

Open Sun-Jing-Kang opened 4 months ago

Sun-Jing-Kang commented 4 months ago

Thank you for your great work!

I have a few questions I'd like to ask you:

Recently, when I reproduce your work on my own dataset, I saw two functions for inference mask named 'evaluate_demo_content_openset_multi_with_content_features' and 'evaluate_visual_prompt_refer_multi_with_content_features'.

In the provided demo, default function use 'evaluate_demo_content_openset_multi_with_content_features', while when I change it to 'evaluate_visual_prompt_refer_multi_with_content_features', the results are poor.

I found that in 'evaluate_demo_content_openset_multi_with_content_features', the tgt come from pretrained weights like 'self.query_feat.weight' and 'self.query_embed.weight', while in 'evaluate_visual_prompt_refer_multi_with_content_features', they come from query position like sam, whether my understanding is correct?

What's the difference between these two methods and how to choose the appropriate method for mask retrieval, and wheather the pretrained weights provided only tend to get better results on the objects already in the training dataset.

Finally, how can I make the algorithm perform well on new objects without retraining the model?

Thank you for your patience and look forward to your reply!

zzz123123123123 commented 4 weeks ago

Hello,I have the same issue. Is the use of the ‘evaluate_visual_prompt_refer_multi_with_content_features’ function the reference segmentation method in the paper, the default function is the general segmentation method, with pre-trained weights.Am I correct in understanding this?