Application of the method on regression problems

Dear authors,

Thank you for your wonderful and interesting work! I have one question about the adaptation of your method on regression problems. When the label space is continuous, like monocular depth estimation, could you please provide some insights on how to modify the current version?

Thanks in advance!

I have the same doubt.

I'm trying to apply this approach to a binary classification problem characterized by a single final FC that gives a probability from 0 to 1.

I'm afraid that RSC is not easily applicable when we have a continuous output. However, I hope that the authors @Justinhzy @dghuangGH will give us some insights.

Hi,

Similar to the classification task which backprop from the logits, it's applicable to backprop from the prediction before calculating the loss on regression problems. There might be some small modifications to the code when switching from classification to regression, but the idea is the same.

If you don't mind, please share the part of the regression implementation with me(zeyih(at)andrew(dot)cmu(dot)edu). I can take a look quickly.

On Tue, Apr 26, 2022 at 9:39 AM silvia1993 @.***> wrote:

I have the same doubt.

I'm trying to apply this approach to a binary classification problem characterized by a single final FC that gives a probability from 0 to 1.

I'm afraid that RSC is not easily applicable when we have a continuous output. However, I hope that the authors @Justinhzy https://github.com/Justinhzy @dghuangGH https://github.com/dghuangGH will give us some insights.

— Reply to this email directly, view it on GitHub https://github.com/DeLightCMU/RSC/issues/22#issuecomment-1109879986, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFI42GXQ5B3ASOMTJPDPEWLVG75YVANCNFSM5P7XJX2A . You are receiving this because you were mentioned.Message ID: @.***>

-- Warm Regards ---Zeyi Huang

@Justinhzy Thank you for your quick reply! Yes, of course I can share the code.

This is my implementation considering a regression problem. I started from your implementation and commented out some parts, It compiles but I'm not sure that it is conceptually correct. I commented with #### REMOVED #### the lines that I removed and with #### CHANGED #### the lines that I modify to fit the regression problem.

class ResNet50(nn.Module):
    def __init__(self, n_classes=1, pretrained=True, hidden_size=2048, dropout=0.5):
        super().__init__()
        self.resnet = torchvision.models.resnet50(pretrained=pretrained)
        self.resnet.fc = nn.Linear(2048, hidden_size)
        self.fc = nn.Linear(hidden_size, n_classes)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(dropout)

        self.pecent = 1/3

    def require_all_grads(self):
        for param in self.parameters():
            param.requires_grad = True

    def forward(self, x,gt=None,flag=None,epoch=None):

        x = self.resnet.conv1(x)
        x = self.resnet.bn1(x)
        x = self.resnet.relu(x)
        x = self.resnet.maxpool(x)

        x = self.resnet.layer1(x)
        x = self.resnet.layer2(x)
        x = self.resnet.layer3(x)
        x = self.resnet.layer4(x)

        if flag:
            interval = 10
            if epoch % interval == 0:
                self.pecent = 3.0 / 10 + (epoch / interval) * 2.0 / 10

            self.eval()
            x_new = x.clone().detach()
            x_new = torch.tensor(x_new.data, requires_grad=True)
            x_new_view = self.resnet.avgpool(x_new)
            x_new_view = x_new_view.view(x_new_view.size(0), -1)
            output = self.fc(self.dropout(self.relu(x_new_view)))
            class_num = output.shape[1]
            index = gt
            num_rois = x_new.shape[0]
            num_channel = x_new.shape[1]
            H = x_new.shape[2]
            HW = x_new.shape[2] * x_new.shape[3]
            one_hot = torch.zeros((1), dtype=torch.float32).cuda()
            one_hot = torch.tensor(one_hot, requires_grad=False)
            sp_i = torch.ones([2, num_rois]).long()
            sp_i[0, :] = torch.arange(num_rois)
            sp_i[1, :] = index
            sp_v = torch.ones([num_rois])
            **#### REMOVED ####** one_hot_sparse = torch.sparse.FloatTensor(sp_i, sp_v, torch.Size([num_rois, class_num])).to_dense().cuda()
            **#### REMOVED ####** one_hot_sparse = torch.tensor(one_hot_sparse, requires_grad=False)
            **#### CHANGED ####** one_hot = torch.sum(output*one_hot_sparse)
            one_hot = torch.sum(output)
            self.zero_grad()
            one_hot.backward()
            grads_val = x_new.grad.clone().detach()
            grad_channel_mean = torch.mean(grads_val.view(num_rois, num_channel, -1), dim=2)
            channel_mean = grad_channel_mean
            grad_channel_mean = grad_channel_mean.view(num_rois, num_channel, 1, 1)
            spatial_mean = torch.sum(x_new * grad_channel_mean, 1)
            spatial_mean = spatial_mean.view(num_rois, HW)
            self.zero_grad()

            choose_one = random.randint(0, 9)
            if choose_one <= 4:
                # ---------------------------- spatial -----------------------
                spatial_drop_num = math.ceil(HW * 1 / 3.0)
                th18_mask_value = torch.sort(spatial_mean, dim=1, descending=True)[0][:, spatial_drop_num]
                **#### CHANGED ####**  th18_mask_value = th18_mask_value.view(num_rois, 1).expand(num_rois, 49)
                th18_mask_value = th18_mask_value.view(num_rois, 1).expand(num_rois, 16)
                mask_all_cuda = torch.where(spatial_mean > th18_mask_value, torch.zeros(spatial_mean.shape).cuda(),
                                            torch.ones(spatial_mean.shape).cuda())
                mask_all = mask_all_cuda.reshape(num_rois, H, H).view(num_rois, 1, H, H)
            else:
                # -------------------------- channel ----------------------------
                vector_thresh_percent = math.ceil(num_channel * 1 / 3.2)
                vector_thresh_value = torch.sort(channel_mean, dim=1, descending=True)[0][:, vector_thresh_percent]
                vector_thresh_value = vector_thresh_value.view(num_rois, 1).expand(num_rois, num_channel)
                vector = torch.where(channel_mean > vector_thresh_value,
                                     torch.zeros(channel_mean.shape).cuda(),
                                     torch.ones(channel_mean.shape).cuda())
                mask_all = vector.view(num_rois, num_channel, 1, 1)

            # ----------------------------------- batch ----------------------------------------
            **#### CHANGED ####** cls_prob_before = F.softmax(output, dim=1)
            cls_prob_before = torch.sigmoid(output).squeeze()
            x_new_view_after = x_new * mask_all
            x_new_view_after = self.resnet.avgpool(x_new_view_after)
            x_new_view_after = x_new_view_after.view(x_new_view_after.size(0), -1)
            x_new_view_after = self.fc(self.dropout(self.relu(x_new_view_after)))
            **#### CHANGED ####** cls_prob_before = F.softmax(output, dim=1)
            cls_prob_after = torch.sigmoid(x_new_view_after).squeeze()#F.softmax(x_new_view_after, dim=1)

            sp_i = torch.ones([2, num_rois]).long()
            sp_i[0, :] = torch.arange(num_rois)
            sp_i[1, :] = index
            sp_v = torch.ones([num_rois])
            **#### REMOVED ####** one_hot_sparse = torch.sparse.FloatTensor(sp_i, sp_v, torch.Size([num_rois, class_num])).to_dense().cuda()
            **#### CHANGED ####** before_vector = torch.sum(one_hot_sparse * cls_prob_before, dim=1)
            before_vector = cls_prob_before
            **#### CHANGED ####** after_vector = torch.sum(one_hot_sparse * cls_prob_after, dim=1)
            after_vector = cls_prob_after
            change_vector = before_vector - after_vector - 0.0001

            change_vector = torch.where(change_vector > 0, change_vector, torch.zeros(change_vector.shape).cuda())
            th_fg_value = torch.sort(change_vector, dim=0, descending=True)[0][int(round(float(num_rois) * self.pecent))]
            drop_index_fg = change_vector.gt(th_fg_value).long()
            ignore_index_fg = 1 - drop_index_fg
            not_01_ignore_index_fg = ignore_index_fg.nonzero()[:, 0]
            mask_all[not_01_ignore_index_fg.long(), :] = 1

            self.train()
            mask_all = torch.tensor(mask_all, requires_grad=True)
            x = x * mask_all

        features = self.resnet.avgpool(x)
        features = features.view(features.size(0), -1)
        outputs = self.fc(self.dropout(self.relu(features)))

        return outputs, features

I hope it is clear. Thank you in advance for your help.

Silvia

Looks good to me. Again, I haven't fully explored RSC on regression. So I list some suggestions and minor issues below.

1) it's better to start with vanilla RSC, which fixes the hyperparameters and only uses spatial dimension. I suggest removing the lines below first. "interval = 10 if epoch % interval == 0: self.pecent = 3.0 / 10 + (epoch / interval) * 2.0 / 10" and change "if choose_one <= 4:" to "if choose_one <= 9:" if it works fine, and you can change the above part back. 2) if I remember correctly, official Resnet doesn't use the dropout layer. Using an additional high ratio dropout layer for Resnet will degrade RSC performance. I suggest not using self.dropout for our implementation. Btw, if you are curious about a more fair comparison, feel free to take a look at the latest paper: https://arxiv.org/abs/2106.03721.

On Wed, Apr 27, 2022 at 12:52 PM silvia1993 @.***> wrote:

@Justinhzy https://github.com/Justinhzy Thank you for your quick reply! Yes, of course I can share the code.

This is my implementation considering a regression problem. I started from your implementation and commented out some parts, It compiles but I'm not sure that it is conceptually correct. I commented with #### REMOVED #### the lines that I removed and with #### CHANGED #### the lines that I modify to fit the regression problem.

class ResNet50(nn.Module): def init(self, n_classes=1, pretrained=True, hidden_size=2048, dropout=0.5): super().init() self.resnet = torchvision.models.resnet50(pretrained=pretrained) self.resnet.fc = nn.Linear(2048, hidden_size) self.fc = nn.Linear(hidden_size, n_classes) self.relu = nn.ReLU() self.dropout = nn.Dropout(dropout)

    self.pecent = 1/3

def require_all_grads(self):
    for param in self.parameters():
        param.requires_grad = True

def forward(self, x,gt=None,flag=None,epoch=None):

    x = self.resnet.conv1(x)
    x = self.resnet.bn1(x)
    x = self.resnet.relu(x)
    x = self.resnet.maxpool(x)

    x = self.resnet.layer1(x)
    x = self.resnet.layer2(x)
    x = self.resnet.layer3(x)
    x = self.resnet.layer4(x)

    if flag:
        interval = 10
        if epoch % interval == 0:
            self.pecent = 3.0 / 10 + (epoch / interval) * 2.0 / 10

        self.eval()
        x_new = x.clone().detach()
        x_new = torch.tensor(x_new.data, requires_grad=True)
        x_new_view = self.resnet.avgpool(x_new)
        x_new_view = x_new_view.view(x_new_view.size(0), -1)
        output = self.fc(self.dropout(self.relu(x_new_view)))
        class_num = output.shape[1]
        index = gt
        num_rois = x_new.shape[0]
        num_channel = x_new.shape[1]
        H = x_new.shape[2]
        HW = x_new.shape[2] * x_new.shape[3]
        one_hot = torch.zeros((1), dtype=torch.float32).cuda()
        one_hot = torch.tensor(one_hot, requires_grad=False)
        sp_i = torch.ones([2, num_rois]).long()
        sp_i[0, :] = torch.arange(num_rois)
        sp_i[1, :] = index
        sp_v = torch.ones([num_rois])
        **#### REMOVED ####** one_hot_sparse = torch.sparse.FloatTensor(sp_i, sp_v, torch.Size([num_rois, class_num])).to_dense().cuda()
        **#### REMOVED ####** one_hot_sparse = torch.tensor(one_hot_sparse, requires_grad=False)
        **#### CHANGED ####** one_hot = torch.sum(output*one_hot_sparse)
        one_hot = torch.sum(output)
        self.zero_grad()
        one_hot.backward()
        grads_val = x_new.grad.clone().detach()
        grad_channel_mean = torch.mean(grads_val.view(num_rois, num_channel, -1), dim=2)
        channel_mean = grad_channel_mean
        grad_channel_mean = grad_channel_mean.view(num_rois, num_channel, 1, 1)
        spatial_mean = torch.sum(x_new * grad_channel_mean, 1)
        spatial_mean = spatial_mean.view(num_rois, HW)
        self.zero_grad()

        choose_one = random.randint(0, 9)
        if choose_one <= 4:
            # ---------------------------- spatial -----------------------
            spatial_drop_num = math.ceil(HW * 1 / 3.0)
            th18_mask_value = torch.sort(spatial_mean, dim=1, descending=True)[0][:, spatial_drop_num]
            **#### CHANGED ####**  th18_mask_value = th18_mask_value.view(num_rois, 1).expand(num_rois, 49)
            th18_mask_value = th18_mask_value.view(num_rois, 1).expand(num_rois, 16)
            mask_all_cuda = torch.where(spatial_mean > th18_mask_value, torch.zeros(spatial_mean.shape).cuda(),
                                        torch.ones(spatial_mean.shape).cuda())
            mask_all = mask_all_cuda.reshape(num_rois, H, H).view(num_rois, 1, H, H)
        else:
            # -------------------------- channel ----------------------------
            vector_thresh_percent = math.ceil(num_channel * 1 / 3.2)
            vector_thresh_value = torch.sort(channel_mean, dim=1, descending=True)[0][:, vector_thresh_percent]
            vector_thresh_value = vector_thresh_value.view(num_rois, 1).expand(num_rois, num_channel)
            vector = torch.where(channel_mean > vector_thresh_value,
                                 torch.zeros(channel_mean.shape).cuda(),
                                 torch.ones(channel_mean.shape).cuda())
            mask_all = vector.view(num_rois, num_channel, 1, 1)

        # ----------------------------------- batch ----------------------------------------
        **#### CHANGED ####** cls_prob_before = F.softmax(output, dim=1)
        cls_prob_before = torch.sigmoid(output).squeeze()
        x_new_view_after = x_new * mask_all
        x_new_view_after = self.resnet.avgpool(x_new_view_after)
        x_new_view_after = x_new_view_after.view(x_new_view_after.size(0), -1)
        x_new_view_after = self.fc(self.dropout(self.relu(x_new_view_after)))
        **#### CHANGED ####** cls_prob_before = F.softmax(output, dim=1)
        cls_prob_after = torch.sigmoid(x_new_view_after).squeeze()#F.softmax(x_new_view_after, dim=1)

        sp_i = torch.ones([2, num_rois]).long()
        sp_i[0, :] = torch.arange(num_rois)
        sp_i[1, :] = index
        sp_v = torch.ones([num_rois])
        **#### REMOVED ####** one_hot_sparse = torch.sparse.FloatTensor(sp_i, sp_v, torch.Size([num_rois, class_num])).to_dense().cuda()
        **#### CHANGED ####** before_vector = torch.sum(one_hot_sparse * cls_prob_before, dim=1)
        before_vector = cls_prob_before
        **#### CHANGED ####** after_vector = torch.sum(one_hot_sparse * cls_prob_after, dim=1)
        after_vector = cls_prob_after
        change_vector = before_vector - after_vector - 0.0001

        change_vector = torch.where(change_vector > 0, change_vector, torch.zeros(change_vector.shape).cuda())
        th_fg_value = torch.sort(change_vector, dim=0, descending=True)[0][int(round(float(num_rois) * self.pecent))]
        drop_index_fg = change_vector.gt(th_fg_value).long()
        ignore_index_fg = 1 - drop_index_fg
        not_01_ignore_index_fg = ignore_index_fg.nonzero()[:, 0]
        mask_all[not_01_ignore_index_fg.long(), :] = 1

        self.train()
        mask_all = torch.tensor(mask_all, requires_grad=True)
        x = x * mask_all

    features = self.resnet.avgpool(x)
    features = features.view(features.size(0), -1)
    outputs = self.fc(self.dropout(self.relu(features)))

    return outputs, features

I hope it is clear. Thank you in advance for your help.

Silvia

— Reply to this email directly, view it on GitHub https://github.com/DeLightCMU/RSC/issues/22#issuecomment-1111309578, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFI42GVBQMDU4O3RSP3XC3DVHF5FLANCNFSM5P7XJX2A . You are receiving this because you were mentioned.Message ID: @.***>

-- Warm Regards ---Zeyi Huang

Thank you for the reply @Justinhzy

But I have still some doubts, in particular when I apply Formula (1) of your paper.

In the standard classification problem (so not continuous) we want (h(z; θtop) ⊙ y) to be high (because we want the maximum probability prediction for the correct classes). When we compute the gradient ∂(h(z; θtop) ⊙ y)/∂z we individuate the components of z that produce the larger increase of (h(z; θtop) ⊙ y) and so the formula works.

In the regression problem instead, we cannot do the element-wise multiplication since h(z; θtop) is already a (1 x Batch_size) vector, and we don't want its maximization since we want a low probability value for samples from class 0 and a large probability value for samples from class 1. So it should be something likes [0.1, 0.9 , 0.1, 0.95] with ground truth [0 , 1 , 0 , 1]. If I compute the formula ∂(h(z; θtop))/∂z it doesn't make sense because I will found the components of z that produce a larger h(z; θtop).

I don't know if I was clear. What do you think about that?

Thank you again, Silvia

Your implementation looks good to me If you intend to do backprop from prediction.

Also, you can try to backprop from loss (all_g = autograd.grad(loss, all_f)[0]). I haven't tried it but I guess there should be some difference between classification and regression.

Let me know if you have any questions.

On Wed, Apr 27, 2022 at 2:41 AM silvia1993 @.***> wrote:

@Justinhzy https://github.com/Justinhzy Thank you for your quick reply! Yes, of course I can share the code.

I start from the DomainBed implementation -> here https://github.com/facebookresearch/DomainBed/blob/8f231f293470b46486182fbb19f3e2b05994de80/domainbed/algorithms.py#L866

This is my implementation considering a regression problem, my main concern is about Equation (1) for the element-wise product that should be done with the one-hot y vector.

def init(self,opt, modelpath=None, learning_rate=1e-4):
    self.model = ResNet50(n_classes=1, pretrained=True)
    self.drop_f = (1 - 1/3) * 100
    self.drop_b = (1 - 1/3) * 100
def do_iteration_RSC(self, loader):
    self.model.train()

    images, targets = next(loader)

    # Resnet50 model
    # all_f -> features before average pooling
    # all_p -> final output, made by just one value (regression problem)
    all_p, all_f = self.forward(images)

    # Equation (1): compute gradients with respect to representation
    # HERE MY DOUBT ->  in DomainBed (all_p * all_o) with all_o one-hot label vector
    all_g = autograd.grad((all_p).sum(), all_f)[0]

    # Equation (2): compute top-gradient-percentile mask
    percentiles = np.percentile(all_g.cpu(), self.drop_f, axis=1)
    percentiles = torch.Tensor(percentiles)
    percentiles = percentiles.unsqueeze(1).repeat(1, all_g.size(1))
    mask_f = all_g.lt(percentiles).float()

    # Equation (3): mute top-gradient-percentile activations
    all_f_muted = all_f * mask_f

    # Equation (4): compute muted predictions
    all_p_muted = self.model.classifier(all_f_muted)

    # Section 3.3: Batch Percentage
    changes = (all_p).sum(1) - (all_p_muted).sum(1)
    percentile = np.percentile(changes.detach().cpu(), self.drop_b)
    mask_b = changes.lt(percentile).float().view(-1, 1)
    mask = torch.logical_or(mask_f, mask_b).float()

    # Equations (3) and (4) again, this time mutting over examples
    all_p_muted_again = self.model.classifier(all_f * mask)

    self.optimizer.zero_grad()

    lossbce = torch.nn.BCEWithLogitsLoss()
    loss = lossbce(all_p_muted_again.squeeze(), targets)
    loss.backward()
    self.optimizer.step()

    return loss.item()
— Reply to this email directly, view it on GitHub https://github.com/DeLightCMU/RSC/issues/22#issuecomment-1110657428, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFI42GSL7GEL66S3UWIVGELVHDVTVANCNFSM5P7XJX2A . You are receiving this because you were mentioned.Message ID: @.***>

-- Warm Regards ---Zeyi Huang

DeLightCMU / RSC

Application of the method on regression problems #22