I have trained the Mask2former model. The instance segmentation result has a very much precise borderline as compared to the results from MaskRCNN. I can speculate that this is because, during the feeding of feature maps from the pixel decoder to the transformer decoder, it adds positional embedding also which helps the model to learn to relate the same object from low to high resolution. Is it so I'm also not sure?
I have trained the Mask2former model. The instance segmentation result has a very much precise borderline as compared to the results from MaskRCNN. I can speculate that this is because, during the feeding of feature maps from the pixel decoder to the transformer decoder, it adds positional embedding also which helps the model to learn to relate the same object from low to high resolution. Is it so I'm also not sure?