Closed bhack closed 1 week ago
yes, we only do interpolation and I think it's the common approach to handle different resolutions.
Have you tried to learn an additional up sampling layer to reduce the artifacts?
unfortunately, we didn't try upsampling layer. if you are interested, you can conduct some experiments and we welcome merging to our code base if that demonstrate better performance.
I have tried a few more layers to progressively upscale up to 1024. Of course these extra are initialized from scratch but it seems hard for them to learn. This is why I have asked if you have experimented with this.
With a non learnable interpolation from 256x256 to 1024 you are going to lost a lot of details.
Has the current mask decoder only a learnable output resolution of
256x256
and then only interpolated?Are you not going to see a lot of interpolation artifacts going from 256x256 to 1024x1024 without any other intermediate learnable step/layer?