CrossmodalGroup / LAPS

Linguistic-Aware Patch Slimming Framework for Fine-grained Cross-Modal Alignment, CVPR, 2024
82 stars 8 forks source link

您好,十分感谢您的出色工作!想请问一下论文中3.2.2 Differentiable Decision Matrix这一部分,在代码中是如何实现的 #4

Open LiangYuHeng66 opened 2 months ago

gaopenghkbu commented 2 months ago

同问,感觉经过分数计算后,每个patch的得分并不是binary的

darkpromise98 commented 2 months ago

We refer to the previous work [1] to construct the Differentiable Decision Matrix by Gumbel softmax. Through extensive experiments, we have found the network does not bring significant benefits but causes training instability. Hence we did not integrate the network into our public codes. It does not affect the reproducibility of the experimental results.

[1] DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification, NeurIPS 2021 (https://github.com/raoyongming/DynamicViT/blob/master/models/dylvvit.py#L512)

We also provide our previous codes:

class TokenSparse(nn.Module):
    def __init__(self, embed_dim=512, sparse_ratio=0.6, attention_weight=0.8):
        super().__init__()

        self.sparse_ratio = sparse_ratio  
        self.attention_weight = attention_weight

        # score network
        self.score_net = nn.Sequential(
            nn.LayerNorm(embed_dim),
            nn.Linear(embed_dim, embed_dim // 4),
            nn.GELU(),
            nn.Linear(embed_dim // 4, 2),
            nn.LogSoftmax(dim=-1),
            )

    def forward(self, x, attention_x=None, attention_y=None, gumbel=False, keepdim=False):

        B_v, L_v, C = x.size()

        # (B_v, L_v, 2) 
        score = self.score_net(x)

        # add attention information        
        attention = (attention_x + attention_y) * 0.5

        # values from [-1, 1] -> [0, 1]
        # the probabilistic form
        attention = (1 + attention) / 2
        attention_prob = torch.stack([attention, 1-attention], dim=2)

        # Add to the score predicted by the network
        # (B_v, L_v, 2)
        score = (1 - self.attention_weight) * score.exp() + self.attention_weight * attention_prob + 1e-8

        # The original score is in the form of logP
        score = torch.log(score)

        # Gumbel-softmax trick
        if gumbel:
            # The original score is in the form of logP
            # Calculate the probability logarithm according to the input of gumbel_softmax
            # It is easy to have problems here, and nan will occur
            if keepdim:
                # (B_v, L_v, 1) 
                score_hard = F.gumbel_softmax(score, hard=True)[..., 0:1]
            else:
                # (B_v, L_v) 
                score_hard = F.gumbel_softmax(score, hard=True)[..., 0]

            return score_hard, score[..., 0].exp()

        # directly return the calculated probability value.
        # (B_v, L_v, 2) 
        return score
LiangYuHeng66 commented 2 months ago

好的,感谢您的回复!您公开的代码中有关Score Estimation部分,选择的是score = attention_x + attention_y,而不是score = (1 - self.attention_weight) score.exp() + self.attention_weight attention_prob + 1e-8,其原因是后者没有带来显着的好处?这样做会不会影响实验结果?期待您的再次回复,谢谢

darkpromise98 commented 2 months ago

好的,感谢您的回复!您公开的代码中有关Score Estimation部分,选择的是score = attention_x + attention_y,而不是score = (1 - self.attention_weight) score.exp() + self.attention_weight attention_prob + 1e-8,其原因是后者没有带来显着的好处?这样做会不会影响实验结果?期待您的再次回复,谢谢

可以这么认为,使用后者没有带来显著性能收益(论文Ablation Study),也存在训练过程中出现Nan的问题,因此使用前者对实验结果没有太大影响,我们也提供了训练日志、模型权重、超参数,能达到论文预期的性能。 https://github.com/CrossmodalGroup/LAPS?tab=readme-ov-file#performances

darkpromise98 commented 2 months ago

你也可以尝试把Differentiable Decision Matrix 和 gumbel softmax 这一策略加入你的研究工作中,但与直接使用attention weights分数作判别相比,可能会带来少许不稳定性。

LiangYuHeng66 commented 2 months ago

好的,十分感谢!

lhqxcc commented 1 month ago

您好,请问一下如果使用gumbal_softmax()如何保证选择预定义数量的重要patch呢?

darkpromise98 commented 1 month ago

您好,请问一下如果使用gumbal_softmax()如何保证选择预定义数量的重要patch呢?

[1] https://github.com/raoyongming/DynamicViT/blob/master/losses.py#L48

lhqxcc commented 1 month ago

好的,十分感谢您的解答!