Open Nastu-Ho opened 9 months ago
I'm also wondering what the authors mean by this. Is it for the mask generation for SA-1B or during the training of SAM itself. i.e., if someone is trying to train a SAM model (with different architecture) do they have to implement this as well.
During the training process, each batch of image embeddings will be iteratively input to the mask decoder 11 times, and the mask decoder will output mask logits 11 times. I wonder if the mask logits of each output need to be used to calculate the loss, or is it just the last prediction to get the mask logits to calculate the loss?