Closed JunzheJosephZhu closed 1 year ago
Hi @JunzheJosephZhu, thanks for your interest in our work.
The queries are initialized with the mappings of the tokenized "the task is {task}"
input (task token
), so the queries only have 3 possible initial values. However, we update the queries first inside a transformer post-initialization (with task token) and then inside the transformer decoder post-concatenation (with task token). So the queries are updated according to the feature representations from the pixel decoder.
Feel free to re-open if you face any issues.
According to the paper, the queries Q are only conditioned on "the task is {task}", but {task} only has 3 possible values. So do the queries only have 3 possible values?