question about the difference between the paper and code implementation

zhengyuan-xie commented 1 year ago

Thanks for your great work! I have some trouble when trying to understand the code. In visual.py, the sos tokens (I also wonder why they are called sos tokens instead of sls tokens) are computed as follows:

 sos_token = cross_attn_layer(
                        resblock,
                        sos_token,
                        x[1:,],
                        attn_biases[i],
                    )

and cross_attn_layer is:

def cross_attn_layer(self: ResidualAttentionBlock, x, mem, attn_bias):
    # x: [K,N,C]
    # mem: [L,N,C]
    # attn_bias: [N*num_head,K,L]
    # return: [K,N,C]
    q_x = self.ln_1(x)
    k_x = v_x = self.ln_1(mem)
    x = x + self.ls_1(
        cross_attn_with_self_bias(self.attn, q_x, k_x, v_x, attn_mask=attn_bias)[0]
    )
    x = x + self.ls_2(self.mlp(self.ln_2(x)))
    return x

It uses the sos_token to obtain q and the visual tokens to obtain k and v, but in section 3 of the paper, the formula (3) seems to use the sls tokens to obtain v. I want to know why there is a discrepancy between the paper and the code, or if there is a problem with my understanding. Thanks!

zhengyuan-xie commented 1 year ago

And there is another silly question: In Fig 4, why the top left corner of the diagram has a white diagonal(query of a sls token can be updated by the key of this sls token but not other keys)? I didn't find the description in the paper. Hope you can help me!

MendelXu commented 1 year ago

Thanks for your great work! I have some trouble when trying to understand the code. In visual.py, the sos tokens (I also wonder why they are called sos tokens instead of sls tokens) are computed as follows:
 sos_token = cross_attn_layer(
                        resblock,
                        sos_token,
                        x[1:,],
                        attn_biases[i],
                    )
and cross_attn_layer is:
def cross_attn_layer(self: ResidualAttentionBlock, x, mem, attn_bias):
    # x: [K,N,C]
    # mem: [L,N,C]
    # attn_bias: [N*num_head,K,L]
    # return: [K,N,C]
    q_x = self.ln_1(x)
    k_x = v_x = self.ln_1(mem)
    x = x + self.ls_1(
        cross_attn_with_self_bias(self.attn, q_x, k_x, v_x, attn_mask=attn_bias)[0]
    )
    x = x + self.ls_2(self.mlp(self.ln_2(x)))
    return x
It uses the sos_token to obtain q and the visual tokens to obtain k and v, but in section 3 of the paper, the formula (3) seems to use the sls tokens to obtain v. I want to know why there is a discrepancy between the paper and the code, or if there is a problem with my understanding. Thanks!

why sos_token....? A: it seems that there is a typo in the code implementation, and it should have been "sls_token" instead, as specified in the paper. While rewriting the code, I was studying NLP, and I mistakenly wrote it as the [sos] token, which represents the start token of the sequence.
why there is a discrepancy between the paper and the code？ A: Thanks for your report, it is a typo in the paper and I will update it later.

MendelXu commented 1 year ago

And there is another silly question: In Fig 4, why the top left corner of the diagram has a white diagonal(query of a sls token can be updated by the key of this sls token but not other keys)? I didn't find the description in the paper. Hope you can help me!

It is a little wierd because initially, all "sls_token" values are the same and that attending to each other would be similar to attending more to oneself. We have tried to let the sls token attend to each other but found performance would dropp slightly.

zhengyuan-xie commented 1 year ago

And there is another silly question: In Fig 4, why the top left corner of the diagram has a white diagonal(query of a sls token can be updated by the key of this sls token but not other keys)? I didn't find the description in the paper. Hope you can help me!

It is a little wierd because initially, all "sls_token" values are the same and that attending to each other would be similar to attending more to oneself. We have tried to let the sls token attend to each other but found performance would dropp slightly.

I understand. Thanks again for your work!

MendelXu / SAN

question about the difference between the paper and code implementation #10