baaivision / Painter

Painter & SegGPT Series: Vision Foundation Models from BAAI
MIT License
2.53k stars 176 forks source link

Question about the learnable image tensor of in-context tuning in SegGPT #18

Open YangHan-Morningstar opened 1 year ago

YangHan-Morningstar commented 1 year ago

Hi there, thanks for your amazing work. After reading your paper of SegGPT. I'm little confused about the in-context tuning. In the paper, during the training stage, SegGPT treat a learnable image tensor as learnable prompt. But in the normal training stage, the input is a pair of in-context images with each mask, such as image1-mask1 and image2-mask2. So the learnable image tensor is a random image-mask? With image3-mask3 from the datasets, the whole input is image-mask(prompt) and image3-mask3? Due to the mask of random image-mask is random, so there is no label for loss calculation and gradient backward, how does it be trained? Please tell me more and help me solve this. Thanks!

SteveImmanuel commented 7 months ago

This is my implementation based on my understanding.

I think that once you used the learnable prompt, you simply replace the image1-mask1 with the image tensor that you optimize