MendelXu / SAN

Open-vocabulary Semantic Segmentation
https://mendelxu.github.io/SAN/
MIT License
295 stars 27 forks source link

question about the difference between the paper and code implementation #10

Closed zhengyuan-xie closed 1 year ago

zhengyuan-xie commented 1 year ago

Thanks for your great work! I have some trouble when trying to understand the code. In visual.py, the sos tokens (I also wonder why they are called sos tokens instead of sls tokens) are computed as follows:

 sos_token = cross_attn_layer(
                        resblock,
                        sos_token,
                        x[1:,],
                        attn_biases[i],
                    )

and cross_attn_layer is:

def cross_attn_layer(self: ResidualAttentionBlock, x, mem, attn_bias):
    # x: [K,N,C]
    # mem: [L,N,C]
    # attn_bias: [N*num_head,K,L]
    # return: [K,N,C]
    q_x = self.ln_1(x)
    k_x = v_x = self.ln_1(mem)
    x = x + self.ls_1(
        cross_attn_with_self_bias(self.attn, q_x, k_x, v_x, attn_mask=attn_bias)[0]
    )
    x = x + self.ls_2(self.mlp(self.ln_2(x)))
    return x

It uses the sos_token to obtain q and the visual tokens to obtain k and v, but in section 3 of the paper, the formula (3) seems to use the sls tokens to obtain v. I want to know why there is a discrepancy between the paper and the code, or if there is a problem with my understanding. Thanks!

zhengyuan-xie commented 1 year ago

And there is another silly question: In Fig 4, why the top left corner of the diagram has a white diagonal(query of a sls token can be updated by the key of this sls token but not other keys)? I didn't find the description in the paper. Hope you can help me!

MendelXu commented 1 year ago

Thanks for your great work! I have some trouble when trying to understand the code. In visual.py, the sos tokens (I also wonder why they are called sos tokens instead of sls tokens) are computed as follows:

 sos_token = cross_attn_layer(
                        resblock,
                        sos_token,
                        x[1:,],
                        attn_biases[i],
                    )

and cross_attn_layer is:

def cross_attn_layer(self: ResidualAttentionBlock, x, mem, attn_bias):
    # x: [K,N,C]
    # mem: [L,N,C]
    # attn_bias: [N*num_head,K,L]
    # return: [K,N,C]
    q_x = self.ln_1(x)
    k_x = v_x = self.ln_1(mem)
    x = x + self.ls_1(
        cross_attn_with_self_bias(self.attn, q_x, k_x, v_x, attn_mask=attn_bias)[0]
    )
    x = x + self.ls_2(self.mlp(self.ln_2(x)))
    return x

It uses the sos_token to obtain q and the visual tokens to obtain k and v, but in section 3 of the paper, the formula (3) seems to use the sls tokens to obtain v. I want to know why there is a discrepancy between the paper and the code, or if there is a problem with my understanding. Thanks!

MendelXu commented 1 year ago

And there is another silly question: In Fig 4, why the top left corner of the diagram has a white diagonal(query of a sls token can be updated by the key of this sls token but not other keys)? I didn't find the description in the paper. Hope you can help me!

It is a little wierd because initially, all "sls_token" values are the same and that attending to each other would be similar to attending more to oneself. We have tried to let the sls token attend to each other but found performance would dropp slightly.

zhengyuan-xie commented 1 year ago

And there is another silly question: In Fig 4, why the top left corner of the diagram has a white diagonal(query of a sls token can be updated by the key of this sls token but not other keys)? I didn't find the description in the paper. Hope you can help me!

It is a little wierd because initially, all "sls_token" values are the same and that attending to each other would be similar to attending more to oneself. We have tried to let the sls token attend to each other but found performance would dropp slightly.

I understand. Thanks again for your work!