Little-Podi / GRM

[CVPR'23] The official PyTorch implementation of our CVPR 2023 paper: "Generalized Relation Modeling for Transformer Tracking".
MIT License
69 stars 8 forks source link

Why didn't your model directly apply the mask with infinity and then apply the softmax function? Could you please explain the following code? #7

Closed JananiKugaa closed 1 year ago

JananiKugaa commented 1 year ago

For stable training

    max_att, _ = torch.max(attn, dim=-1, keepdim=True)
    attn = attn - max_att
    attn = attn.to(torch.float32).exp_() * attn_policy.to(torch.float32)
    attn = (attn + eps / N) / (attn.sum(dim=-1, keepdim=True) + eps)
Little-Podi commented 1 year ago

The direct replace operation is not differentiable, which means the attn_policy produced by our prediction modules will not receive any gradients during training. Thus, it is not suitable as the prediction modules for token division are not learning at all. However, during inference, you can use any operation with the identical functionality if you think it is better.