Hello, thanks for making the code available. I have a question. Is it possible to obtain per-pixel features (e.g., 512-D or 768-D) instead of N_mask x W x H and N_mask x feat_dim that the encoder provides as output?
On a similar note, Is it a correct understanding that the mask proposal generation and then subsequent classification architectures does not have an intermediate per-pixel feature representation?
Hello, thanks for making the code available. I have a question. Is it possible to obtain per-pixel features (e.g., 512-D or 768-D) instead of N_mask x W x H and N_mask x feat_dim that the encoder provides as output?
On a similar note, Is it a correct understanding that the mask proposal generation and then subsequent classification architectures does not have an intermediate per-pixel feature representation?