Closed cgebbe closed 1 month ago
Hi,
thank you for the great question. Regarding 1: We have not tested the Mask2Former approach. However, I am also curious on how it would perform, but do not have the capacity right now to test it. Regarding 2: I am not sure if you can state this totally, as the ViT encoder used by Pluto is a FlexiViT network, like explained here https://arxiv.org/pdf/2212.08013. I am also not sure about the multiscale input images. I would say, that there model is a more general one, encompassing a broader range of tasks that can be solved with it.
Thank you for your great paper and code first of all!
The recent PLUTO paper (https://arxiv.org/pdf/2405.07905) also benchmarked its work. It uses the following decoders:
Questions: