Question about the learnable image tensor of in-context tuning in SegGPT

Hi there, thanks for your amazing work. After reading your paper of SegGPT. I'm little confused about the in-context tuning. In the paper, during the training stage, SegGPT treat a learnable image tensor as learnable prompt. But in the normal training stage, the input is a pair of in-context images with each mask, such as image1-mask1 and image2-mask2. So the learnable image tensor is a random image-mask? With image3-mask3 from the datasets, the whole input is image-mask(prompt) and image3-mask3? Due to the mask of random image-mask is random, so there is no label for loss calculation and gradient backward, how does it be trained? Please tell me more and help me solve this. Thanks!

baaivision / Painter

Question about the learnable image tensor of in-context tuning in SegGPT #18