Cogito2012 / DEAR

[ICCV 2021 Oral] Deep Evidential Action Recognition
Apache License 2.0
117 stars 18 forks source link

It seems there are no 3D operations in DebiasHead. #9

Closed RongchangLi closed 1 year ago

RongchangLi commented 1 year ago

It seems that the debiashead is used to implement CED. But it seems no 3D operations though some modules (self.f1_conv3d, self.f2_conv3d) are named with '3D'. Because the temporal size of conlolution kenerls is 1.

In this way , shuffling the feat actually won't make any sense. Actually it seems no difference between this three branch: 1.(f1_conv3d-->avg_pool-->fc1), 2.(temporal shuffling-->f2_conv3d-->avg_pool-->fc2) 3.(reshape-->f3_conv2d-->avg_pool-->fc3).

Here is the code in

https://github.com/Cogito2012/DEAR/tree/master/mmaction/models/heads/debias_head.py:

@HEADS.register_module()
class DebiasHead(BaseHead):
    """Debias head.

    Args:
        num_classes (int): Number of classes to be classified.
        in_channels (int): Number of channels in input feature.
        loss_cls (dict): Config for building loss.
            Default: dict(type='EvidenceLoss')
        spatial_type (str): Pooling type in spatial dimension. Default: 'avg'.
        dropout_ratio (float): Probability of dropout layer. Default: 0.5.
        init_std (float): Std value for Initiation. Default: 0.01.
        kwargs (dict, optional): Any keyword argument to be used to initialize
            the head.
    """

    def __init__(self,
                 num_classes,
                 in_channels,
                 loss_cls=dict(type='EvidenceLoss'),
                 loss_factor=0.1,
                 hsic_factor=0.5,  # useful when alternative=True
                 alternative=False,
                 bias_input=True,
                 bias_network=True,
                 dropout_ratio=0.5,
                 init_std=0.01,
                 **kwargs):
        super().__init__(num_classes, in_channels, loss_cls, **kwargs)
        self.bias_input = bias_input
        self.bias_network = bias_network
        assert bias_input or bias_network, "At least one of the choices (bias_input, bias_network) should be True!"
        self.loss_factor = loss_factor
        self.hsic_factor = hsic_factor
        self.alternative = alternative
        self.f1_conv3d = ConvModule(
            in_channels,
            in_channels * 2, (1, 3, 3),
            stride=(1, 2, 2),
            padding=(0, 1, 1),
            bias=False,
            conv_cfg=dict(type='Conv3d'),
            norm_cfg=dict(type='BN3d', requires_grad=True))
        if bias_input:
            self.f2_conv3d = ConvModule(
                in_channels,
                in_channels * 2, (1, 3, 3),
                stride=(1, 2, 2),
                padding=(0, 1, 1),
                bias=False,
                conv_cfg=dict(type='Conv3d'),
                norm_cfg=dict(type='BN3d', requires_grad=True))
        if bias_network:
            self.f3_conv2d = ConvModule(
                in_channels,
                in_channels * 2, (3, 3),
                stride=(2, 2),
                padding=(1, 1),
                bias=False,
                conv_cfg=dict(type='Conv2d'),
                norm_cfg=dict(type='BN', requires_grad=True))
        self.dropout_ratio = dropout_ratio
        self.init_std = init_std
        if self.dropout_ratio != 0:
            self.dropout = nn.Dropout(p=self.dropout_ratio)
        else:
            self.dropout = None
        self.f1_fc = nn.Linear(self.in_channels * 2, self.num_classes)
        self.f2_fc = nn.Linear(self.in_channels * 2, self.num_classes)
        self.f3_fc = nn.Linear(self.in_channels * 2, self.num_classes)
        self.avg_pool = nn.AdaptiveAvgPool3d((1, 1, 1))

     .............
        def forward(self, x, num_segs=None, target=None, **kwargs):
        """Defines the computation performed at every call.

        Args:
            x (torch.Tensor): The input data. (B, 1024, 8, 14, 14)

        Returns:
            torch.Tensor: The classification scores for input samples.
        """
        feat = x.clone() if isinstance(x, torch.Tensor) else x[-2].clone()
        if len(feat.size()) == 4:  # for 2D recognizer
            assert num_segs is not None
            feat = feat.view((-1, num_segs) + feat.size()[1:]).transpose(1, 2).contiguous()
        # one-hot embedding for the target
        y = torch.eye(self.num_classes).to(feat.device)
        y = y[target]
        losses = dict()

        # f1_Conv3D(x)
        x = self.f1_conv3d(feat)  # (B, 2048, 8, 7, 7)
        feat_unbias = self.avg_pool(x).squeeze(-1).squeeze(-1).squeeze(-1)
        x = self.dropout(feat_unbias)
        x = self.f1_fc(x)
        alpha_unbias = self.exp_evidence(x) + 1
        # minimize the edl losses
        loss_cls1 = self.edl_loss(torch.log, alpha_unbias, y)
        losses.update({'loss_unbias_cls': loss_cls1})

        loss_hsic_f, loss_hsic_g = torch.zeros_like(loss_cls1), torch.zeros_like(loss_cls1)
        if self.bias_input:
            # f2_Conv3D(x)
            feat_shuffle = feat[:, :, torch.randperm(feat.size()[2])]
            x = self.f2_conv3d(feat_shuffle)  # (B, 2048, 8, 7, 7)
            feat_bias1 = self.avg_pool(x).squeeze(-1).squeeze(-1).squeeze(-1)
            x = self.dropout(feat_bias1)
            x = self.f2_fc(x)
            alpha_bias1 = self.exp_evidence(x) + 1
            # minimize the edl losses
            loss_cls2 = self.edl_loss(torch.log, alpha_bias1, y)
            losses.update({'loss_bias1_cls': loss_cls2})
            if self.alternative:
                # minimize HSIC w.r.t. feat_unbias, and maximize HSIC w.r.t. feat_bias1
                loss_hsic_f += self.hsic_factor * self.hsic_loss(feat_unbias, feat_bias1.detach(), unbiased=True) 
                loss_hsic_g += - self.hsic_factor * self.hsic_loss(feat_unbias.detach(), feat_bias1, unbiased=True)
            else:
                # maximize HSIC 
                loss_hsic1 = -1.0 * self.hsic_loss(alpha_unbias, alpha_bias1)
                losses.update({"loss_bias1_hsic": loss_hsic1})

        if self.bias_network:
            # f3_Conv2D(x)
            B, C, T, H, W = feat.size()
            feat_reshape = feat.permute(0, 2, 1, 3, 4).contiguous().view(-1, C, H, W)  # (B*T, C, H, W)
            x = self.f3_conv2d(feat_reshape)  # (64, 2048, 7, 7)
            x = x.view(B, T, x.size(-3), x.size(-2), x.size(-1)).permute(0, 2, 1, 3, 4)  # (B, 2048, 8, 7, 7)
            feat_bias2 = self.avg_pool(x).squeeze(-1).squeeze(-1).squeeze(-1)
            x = self.dropout(feat_bias2)
            x = self.f3_fc(x)
            alpha_bias2 = self.exp_evidence(x) + 1
            # minimize the edl losses
            loss_cls3 = self.edl_loss(torch.log, alpha_bias2, y)
            losses.update({'loss_bias2_cls': loss_cls3})
            if self.alternative:
                # minimize HSIC w.r.t. feat_unbias, and maximize HSIC w.r.t. feat_bias2
                loss_hsic_f += self.hsic_factor * self.hsic_loss(feat_unbias, feat_bias2.detach(), unbiased=True)
                loss_hsic_g += - self.hsic_factor * self.hsic_loss(feat_unbias.detach(), feat_bias2, unbiased=True)
            else:
                # maximize HSIC 
                loss_hsic2 = -1.0 * self.hsic_loss(alpha_unbias, alpha_bias2)
                losses.update({"loss_bias2_hsic": loss_hsic2})

        if self.alternative:
            # Here, we use odd iterations for minimizing hsic_f, and use even iterations for maximizing hsic_g
            assert 'iter' in kwargs, "iter number is missing!"
            loss_mask = kwargs['iter'] % 2
            loss_hsic = loss_mask * loss_hsic_f + (1 - loss_mask) * loss_hsic_g
            losses.update({'loss_hsic': loss_hsic})

        for k, v in losses.items():
            losses.update({k: v * self.loss_factor})
        return losses
Cogito2012 commented 1 year ago

@RongchangLi Thanks for pointing out this issue. Yes, you're right, this is a mistake when implementing the module. The size of kernels for self.f1_conv3d, self.f2_conv3d, and self.f3_conv2d can all be set to 3. The code lines as pointed are updated.

After making this change, the results on HMDB-51 show around 0.7% and 0.2% drop in terms of Open maF1 and Open Set AUC when compared to the results reported in the main paper. Open maF1 Open Set AUC Closed Set Acc
76.52 (0.17) 76.81 94.03
RongchangLi commented 1 year ago

@RongchangLi Thanks for pointing out this issue.

But 3D operations seem essential for debias operations. When the three branch are the same, it doesn't seem to have the ability to remove static bias shown in the paper. I am confused why it still works and the performance drops after debugging?

Cogito2012 commented 1 year ago

If the three branches are using Conv2D, the loss functions will remove the appearance bias. For example, you may say that in an implicit way, the middle branch learns the foreground appearance feature while the rest two branches learn the background appearance feature.

For the performance drop, it could be explained that Conv3D features are not optimally learned since I did not change any hyperparameters that are tuned based on the previous implementation.

RongchangLi commented 1 year ago

If the three branches are using Conv2D, the loss functions will remove the appearance bias. For example, you may say that in an implicit way, the middle branch learns the foreground appearance feature while the rest two branches learn the background appearance feature.

For the performance drop, it could be explained that Conv3D features are not optimally learned since I did not change any hyperparameters that are tuned based on the previous implementation.

This article is insightful.So I want to discuss some issues in more detail. Is it too subjective to think that implicit elimination of appearance bias has occurred? When all three branches are 2D, the inputs of the three branches are high-order features extracted from the same spatial-temporal network, and the structure of the three branches is the same and simple (consisting of only one convolution, pooling, and fully-connected layer). The only difference is that the loss function makes the output of one branch different from the outputs of the other two branches. Why can we say that this process eliminates appearance bias?

Cogito2012 commented 1 year ago

If the three branches are using Conv2D, the loss functions will remove the appearance bias. For example, you may say that in an implicit way, the middle branch learns the foreground appearance feature while the rest two branches learn the background appearance feature. For the performance drop, it could be explained that Conv3D features are not optimally learned since I did not change any hyperparameters that are tuned based on the previous implementation.

This article is insightful.So I want to discuss some issues in more detail. Is it too subjective to think that implicit elimination of appearance bias has occurred? When all three branches are 2D, the inputs of the three branches are high-order features extracted from the same spatial-temporal network, and the structure of the three branches is the same and simple (consisting of only one convolution, pooling, and fully-connected layer). The only difference is that the loss function makes the output of one branch different from the outputs of the other two branches. Why can we say that this process eliminates appearance bias?

I see your points. In this case, the middle branch and the others will learn to give two sets of features, which are independent of each other but both discriminative for classification. Without explicit inductive bias (like the Conv2D/shuffle vs Conv3D) in the design, we indeed cannot claim the debiasing effect.

RongchangLi commented 1 year ago

Thank you very much for your patiently explaining. It helps a lot.