Hi there, thanks for your amazing work. After reading your paper of SegGPT. I'm little confused about the in-context tuning. In the paper, during the training stage, SegGPT treat a learnable image tensor as learnable prompt. But in the normal training stage, the input is a pair of in-context images with each mask, such as image1-mask1 and image2-mask2. So the learnable image tensor is a random image-mask? With image3-mask3 from the datasets, the whole input is image-mask(prompt) and image3-mask3? Due to the mask of random image-mask is random, so there is no label for loss calculation and gradient backward, how does it be trained? Please tell me more and help me solve this. Thanks!
Hi there, thanks for your amazing work. After reading your paper of SegGPT. I'm little confused about the in-context tuning. In the paper, during the training stage, SegGPT treat a learnable image tensor as learnable prompt. But in the normal training stage, the input is a pair of in-context images with each mask, such as image1-mask1 and image2-mask2. So the learnable image tensor is a random image-mask? With image3-mask3 from the datasets, the whole input is image-mask(prompt) and image3-mask3? Due to the mask of random image-mask is random, so there is no label for loss calculation and gradient backward, how does it be trained? Please tell me more and help me solve this. Thanks!