Closed kcs6568 closed 1 month ago
Sorry for the late response. For the first question, the multi-scale features refer to features extracted from multiple levels of ViT. Besides, as our method does not limit the architecture of the encoder, the extracted features will be multi-scale when using SwinTransformer and ResNet as the encoder. For the second question, we implement the global relationship construction by parameter sharing. More specifically, it can be seen in Sec 3.3 Task-sharing generic path that ''all task features will go through this generic convolution, it will be optimized by the gradients of different tasks simultaneously, which can help extract common features among all tasks''. The common features here can be considered as a kind of global relationship. I hope this could help.
I appreciate your answer.
Thank you!
Hi! Thank you for your great work!
I have a basic question.
In section 3.1. of your paper, there is this sentence as below: "We utilize an off-the-shelf vision transformer (ViT) as the encoder and collect multiscale features from different layers."
In my knowledge, all layers in the ViT encoder output the same scale embedding, which has [L x C] dimension spaces. Also, there are the multi-scale feature sets { Xl = Fl(I)}. In summary, I cannot understand the meaning of "multi-scale" clearly.
Also, I want to know the reason or reference that can achieve the global relationship by simple generic convolution. In my view, there is no part of the reason for the global relationship by GC. (I'm sorry if I offended you, that wasn't my intent. )
Please describe the above question!
Thank you in advance!!!!