Open Huage001 opened 2 years ago
I have the same doubt.
I'm trying to apply this approach to a binary classification problem characterized by a single final FC that gives a probability from 0 to 1.
I'm afraid that RSC is not easily applicable when we have a continuous output. However, I hope that the authors @Justinhzy @dghuangGH will give us some insights.
Hi,
Similar to the classification task which backprop from the logits, it's applicable to backprop from the prediction before calculating the loss on regression problems. There might be some small modifications to the code when switching from classification to regression, but the idea is the same.
If you don't mind, please share the part of the regression implementation with me(zeyih(at)andrew(dot)cmu(dot)edu). I can take a look quickly.
On Tue, Apr 26, 2022 at 9:39 AM silvia1993 @.***> wrote:
I have the same doubt.
I'm trying to apply this approach to a binary classification problem characterized by a single final FC that gives a probability from 0 to 1.
I'm afraid that RSC is not easily applicable when we have a continuous output. However, I hope that the authors @Justinhzy https://github.com/Justinhzy @dghuangGH https://github.com/dghuangGH will give us some insights.
— Reply to this email directly, view it on GitHub https://github.com/DeLightCMU/RSC/issues/22#issuecomment-1109879986, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFI42GXQ5B3ASOMTJPDPEWLVG75YVANCNFSM5P7XJX2A . You are receiving this because you were mentioned.Message ID: @.***>
-- Warm Regards ---Zeyi Huang
@Justinhzy Thank you for your quick reply! Yes, of course I can share the code.
This is my implementation considering a regression problem. I started from your implementation and commented out some parts, It compiles but I'm not sure that it is conceptually correct. I commented with #### REMOVED #### the lines that I removed and with #### CHANGED #### the lines that I modify to fit the regression problem.
class ResNet50(nn.Module):
def __init__(self, n_classes=1, pretrained=True, hidden_size=2048, dropout=0.5):
super().__init__()
self.resnet = torchvision.models.resnet50(pretrained=pretrained)
self.resnet.fc = nn.Linear(2048, hidden_size)
self.fc = nn.Linear(hidden_size, n_classes)
self.relu = nn.ReLU()
self.dropout = nn.Dropout(dropout)
self.pecent = 1/3
def require_all_grads(self):
for param in self.parameters():
param.requires_grad = True
def forward(self, x,gt=None,flag=None,epoch=None):
x = self.resnet.conv1(x)
x = self.resnet.bn1(x)
x = self.resnet.relu(x)
x = self.resnet.maxpool(x)
x = self.resnet.layer1(x)
x = self.resnet.layer2(x)
x = self.resnet.layer3(x)
x = self.resnet.layer4(x)
if flag:
interval = 10
if epoch % interval == 0:
self.pecent = 3.0 / 10 + (epoch / interval) * 2.0 / 10
self.eval()
x_new = x.clone().detach()
x_new = torch.tensor(x_new.data, requires_grad=True)
x_new_view = self.resnet.avgpool(x_new)
x_new_view = x_new_view.view(x_new_view.size(0), -1)
output = self.fc(self.dropout(self.relu(x_new_view)))
class_num = output.shape[1]
index = gt
num_rois = x_new.shape[0]
num_channel = x_new.shape[1]
H = x_new.shape[2]
HW = x_new.shape[2] * x_new.shape[3]
one_hot = torch.zeros((1), dtype=torch.float32).cuda()
one_hot = torch.tensor(one_hot, requires_grad=False)
sp_i = torch.ones([2, num_rois]).long()
sp_i[0, :] = torch.arange(num_rois)
sp_i[1, :] = index
sp_v = torch.ones([num_rois])
**#### REMOVED ####** one_hot_sparse = torch.sparse.FloatTensor(sp_i, sp_v, torch.Size([num_rois, class_num])).to_dense().cuda()
**#### REMOVED ####** one_hot_sparse = torch.tensor(one_hot_sparse, requires_grad=False)
**#### CHANGED ####** one_hot = torch.sum(output*one_hot_sparse)
one_hot = torch.sum(output)
self.zero_grad()
one_hot.backward()
grads_val = x_new.grad.clone().detach()
grad_channel_mean = torch.mean(grads_val.view(num_rois, num_channel, -1), dim=2)
channel_mean = grad_channel_mean
grad_channel_mean = grad_channel_mean.view(num_rois, num_channel, 1, 1)
spatial_mean = torch.sum(x_new * grad_channel_mean, 1)
spatial_mean = spatial_mean.view(num_rois, HW)
self.zero_grad()
choose_one = random.randint(0, 9)
if choose_one <= 4:
# ---------------------------- spatial -----------------------
spatial_drop_num = math.ceil(HW * 1 / 3.0)
th18_mask_value = torch.sort(spatial_mean, dim=1, descending=True)[0][:, spatial_drop_num]
**#### CHANGED ####** th18_mask_value = th18_mask_value.view(num_rois, 1).expand(num_rois, 49)
th18_mask_value = th18_mask_value.view(num_rois, 1).expand(num_rois, 16)
mask_all_cuda = torch.where(spatial_mean > th18_mask_value, torch.zeros(spatial_mean.shape).cuda(),
torch.ones(spatial_mean.shape).cuda())
mask_all = mask_all_cuda.reshape(num_rois, H, H).view(num_rois, 1, H, H)
else:
# -------------------------- channel ----------------------------
vector_thresh_percent = math.ceil(num_channel * 1 / 3.2)
vector_thresh_value = torch.sort(channel_mean, dim=1, descending=True)[0][:, vector_thresh_percent]
vector_thresh_value = vector_thresh_value.view(num_rois, 1).expand(num_rois, num_channel)
vector = torch.where(channel_mean > vector_thresh_value,
torch.zeros(channel_mean.shape).cuda(),
torch.ones(channel_mean.shape).cuda())
mask_all = vector.view(num_rois, num_channel, 1, 1)
# ----------------------------------- batch ----------------------------------------
**#### CHANGED ####** cls_prob_before = F.softmax(output, dim=1)
cls_prob_before = torch.sigmoid(output).squeeze()
x_new_view_after = x_new * mask_all
x_new_view_after = self.resnet.avgpool(x_new_view_after)
x_new_view_after = x_new_view_after.view(x_new_view_after.size(0), -1)
x_new_view_after = self.fc(self.dropout(self.relu(x_new_view_after)))
**#### CHANGED ####** cls_prob_before = F.softmax(output, dim=1)
cls_prob_after = torch.sigmoid(x_new_view_after).squeeze()#F.softmax(x_new_view_after, dim=1)
sp_i = torch.ones([2, num_rois]).long()
sp_i[0, :] = torch.arange(num_rois)
sp_i[1, :] = index
sp_v = torch.ones([num_rois])
**#### REMOVED ####** one_hot_sparse = torch.sparse.FloatTensor(sp_i, sp_v, torch.Size([num_rois, class_num])).to_dense().cuda()
**#### CHANGED ####** before_vector = torch.sum(one_hot_sparse * cls_prob_before, dim=1)
before_vector = cls_prob_before
**#### CHANGED ####** after_vector = torch.sum(one_hot_sparse * cls_prob_after, dim=1)
after_vector = cls_prob_after
change_vector = before_vector - after_vector - 0.0001
change_vector = torch.where(change_vector > 0, change_vector, torch.zeros(change_vector.shape).cuda())
th_fg_value = torch.sort(change_vector, dim=0, descending=True)[0][int(round(float(num_rois) * self.pecent))]
drop_index_fg = change_vector.gt(th_fg_value).long()
ignore_index_fg = 1 - drop_index_fg
not_01_ignore_index_fg = ignore_index_fg.nonzero()[:, 0]
mask_all[not_01_ignore_index_fg.long(), :] = 1
self.train()
mask_all = torch.tensor(mask_all, requires_grad=True)
x = x * mask_all
features = self.resnet.avgpool(x)
features = features.view(features.size(0), -1)
outputs = self.fc(self.dropout(self.relu(features)))
return outputs, features
I hope it is clear. Thank you in advance for your help.
Silvia
Looks good to me. Again, I haven't fully explored RSC on regression. So I list some suggestions and minor issues below.
1) it's better to start with vanilla RSC, which fixes the hyperparameters and only uses spatial dimension. I suggest removing the lines below first. "interval = 10 if epoch % interval == 0: self.pecent = 3.0 / 10 + (epoch / interval) * 2.0 / 10" and change "if choose_one <= 4:" to "if choose_one <= 9:" if it works fine, and you can change the above part back. 2) if I remember correctly, official Resnet doesn't use the dropout layer. Using an additional high ratio dropout layer for Resnet will degrade RSC performance. I suggest not using self.dropout for our implementation. Btw, if you are curious about a more fair comparison, feel free to take a look at the latest paper: https://arxiv.org/abs/2106.03721.
On Wed, Apr 27, 2022 at 12:52 PM silvia1993 @.***> wrote:
@Justinhzy https://github.com/Justinhzy Thank you for your quick reply! Yes, of course I can share the code.
This is my implementation considering a regression problem. I started from your implementation and commented out some parts, It compiles but I'm not sure that it is conceptually correct. I commented with #### REMOVED #### the lines that I removed and with #### CHANGED #### the lines that I modify to fit the regression problem.
class ResNet50(nn.Module): def init(self, n_classes=1, pretrained=True, hidden_size=2048, dropout=0.5): super().init() self.resnet = torchvision.models.resnet50(pretrained=pretrained) self.resnet.fc = nn.Linear(2048, hidden_size) self.fc = nn.Linear(hidden_size, n_classes) self.relu = nn.ReLU() self.dropout = nn.Dropout(dropout)
self.pecent = 1/3 def require_all_grads(self): for param in self.parameters(): param.requires_grad = True def forward(self, x,gt=None,flag=None,epoch=None): x = self.resnet.conv1(x) x = self.resnet.bn1(x) x = self.resnet.relu(x) x = self.resnet.maxpool(x) x = self.resnet.layer1(x) x = self.resnet.layer2(x) x = self.resnet.layer3(x) x = self.resnet.layer4(x) if flag: interval = 10 if epoch % interval == 0: self.pecent = 3.0 / 10 + (epoch / interval) * 2.0 / 10 self.eval() x_new = x.clone().detach() x_new = torch.tensor(x_new.data, requires_grad=True) x_new_view = self.resnet.avgpool(x_new) x_new_view = x_new_view.view(x_new_view.size(0), -1) output = self.fc(self.dropout(self.relu(x_new_view))) class_num = output.shape[1] index = gt num_rois = x_new.shape[0] num_channel = x_new.shape[1] H = x_new.shape[2] HW = x_new.shape[2] * x_new.shape[3] one_hot = torch.zeros((1), dtype=torch.float32).cuda() one_hot = torch.tensor(one_hot, requires_grad=False) sp_i = torch.ones([2, num_rois]).long() sp_i[0, :] = torch.arange(num_rois) sp_i[1, :] = index sp_v = torch.ones([num_rois]) **#### REMOVED ####** one_hot_sparse = torch.sparse.FloatTensor(sp_i, sp_v, torch.Size([num_rois, class_num])).to_dense().cuda() **#### REMOVED ####** one_hot_sparse = torch.tensor(one_hot_sparse, requires_grad=False) **#### CHANGED ####** one_hot = torch.sum(output*one_hot_sparse) one_hot = torch.sum(output) self.zero_grad() one_hot.backward() grads_val = x_new.grad.clone().detach() grad_channel_mean = torch.mean(grads_val.view(num_rois, num_channel, -1), dim=2) channel_mean = grad_channel_mean grad_channel_mean = grad_channel_mean.view(num_rois, num_channel, 1, 1) spatial_mean = torch.sum(x_new * grad_channel_mean, 1) spatial_mean = spatial_mean.view(num_rois, HW) self.zero_grad() choose_one = random.randint(0, 9) if choose_one <= 4: # ---------------------------- spatial ----------------------- spatial_drop_num = math.ceil(HW * 1 / 3.0) th18_mask_value = torch.sort(spatial_mean, dim=1, descending=True)[0][:, spatial_drop_num] **#### CHANGED ####** th18_mask_value = th18_mask_value.view(num_rois, 1).expand(num_rois, 49) th18_mask_value = th18_mask_value.view(num_rois, 1).expand(num_rois, 16) mask_all_cuda = torch.where(spatial_mean > th18_mask_value, torch.zeros(spatial_mean.shape).cuda(), torch.ones(spatial_mean.shape).cuda()) mask_all = mask_all_cuda.reshape(num_rois, H, H).view(num_rois, 1, H, H) else: # -------------------------- channel ---------------------------- vector_thresh_percent = math.ceil(num_channel * 1 / 3.2) vector_thresh_value = torch.sort(channel_mean, dim=1, descending=True)[0][:, vector_thresh_percent] vector_thresh_value = vector_thresh_value.view(num_rois, 1).expand(num_rois, num_channel) vector = torch.where(channel_mean > vector_thresh_value, torch.zeros(channel_mean.shape).cuda(), torch.ones(channel_mean.shape).cuda()) mask_all = vector.view(num_rois, num_channel, 1, 1) # ----------------------------------- batch ---------------------------------------- **#### CHANGED ####** cls_prob_before = F.softmax(output, dim=1) cls_prob_before = torch.sigmoid(output).squeeze() x_new_view_after = x_new * mask_all x_new_view_after = self.resnet.avgpool(x_new_view_after) x_new_view_after = x_new_view_after.view(x_new_view_after.size(0), -1) x_new_view_after = self.fc(self.dropout(self.relu(x_new_view_after))) **#### CHANGED ####** cls_prob_before = F.softmax(output, dim=1) cls_prob_after = torch.sigmoid(x_new_view_after).squeeze()#F.softmax(x_new_view_after, dim=1) sp_i = torch.ones([2, num_rois]).long() sp_i[0, :] = torch.arange(num_rois) sp_i[1, :] = index sp_v = torch.ones([num_rois]) **#### REMOVED ####** one_hot_sparse = torch.sparse.FloatTensor(sp_i, sp_v, torch.Size([num_rois, class_num])).to_dense().cuda() **#### CHANGED ####** before_vector = torch.sum(one_hot_sparse * cls_prob_before, dim=1) before_vector = cls_prob_before **#### CHANGED ####** after_vector = torch.sum(one_hot_sparse * cls_prob_after, dim=1) after_vector = cls_prob_after change_vector = before_vector - after_vector - 0.0001 change_vector = torch.where(change_vector > 0, change_vector, torch.zeros(change_vector.shape).cuda()) th_fg_value = torch.sort(change_vector, dim=0, descending=True)[0][int(round(float(num_rois) * self.pecent))] drop_index_fg = change_vector.gt(th_fg_value).long() ignore_index_fg = 1 - drop_index_fg not_01_ignore_index_fg = ignore_index_fg.nonzero()[:, 0] mask_all[not_01_ignore_index_fg.long(), :] = 1 self.train() mask_all = torch.tensor(mask_all, requires_grad=True) x = x * mask_all features = self.resnet.avgpool(x) features = features.view(features.size(0), -1) outputs = self.fc(self.dropout(self.relu(features))) return outputs, features
I hope it is clear. Thank you in advance for your help.
Silvia
— Reply to this email directly, view it on GitHub https://github.com/DeLightCMU/RSC/issues/22#issuecomment-1111309578, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFI42GVBQMDU4O3RSP3XC3DVHF5FLANCNFSM5P7XJX2A . You are receiving this because you were mentioned.Message ID: @.***>
-- Warm Regards ---Zeyi Huang
Thank you for the reply @Justinhzy
But I have still some doubts, in particular when I apply Formula (1) of your paper.
In the standard classification problem (so not continuous) we want (h(z; θtop) ⊙ y) to be high (because we want the maximum probability prediction for the correct classes). When we compute the gradient ∂(h(z; θtop) ⊙ y)/∂z we individuate the components of z that produce the larger increase of (h(z; θtop) ⊙ y) and so the formula works.
In the regression problem instead, we cannot do the element-wise multiplication since h(z; θtop) is already a (1 x Batch_size) vector, and we don't want its maximization since we want a low probability value for samples from class 0 and a large probability value for samples from class 1. So it should be something likes [0.1, 0.9 , 0.1, 0.95] with ground truth [0 , 1 , 0 , 1]. If I compute the formula ∂(h(z; θtop))/∂z it doesn't make sense because I will found the components of z that produce a larger h(z; θtop).
I don't know if I was clear. What do you think about that?
Thank you again, Silvia
Your implementation looks good to me If you intend to do backprop from prediction.
Also, you can try to backprop from loss (all_g = autograd.grad(loss, all_f)[0]). I haven't tried it but I guess there should be some difference between classification and regression.
Let me know if you have any questions.
On Wed, Apr 27, 2022 at 2:41 AM silvia1993 @.***> wrote:
@Justinhzy https://github.com/Justinhzy Thank you for your quick reply! Yes, of course I can share the code.
I start from the DomainBed implementation -> here https://github.com/facebookresearch/DomainBed/blob/8f231f293470b46486182fbb19f3e2b05994de80/domainbed/algorithms.py#L866
This is my implementation considering a regression problem, my main concern is about Equation (1) for the element-wise product that should be done with the one-hot y vector.
def init(self,opt, modelpath=None, learning_rate=1e-4):
self.model = ResNet50(n_classes=1, pretrained=True) self.drop_f = (1 - 1/3) * 100 self.drop_b = (1 - 1/3) * 100
def do_iteration_RSC(self, loader):
self.model.train() images, targets = next(loader) # Resnet50 model # all_f -> features before average pooling # all_p -> final output, made by just one value (regression problem) all_p, all_f = self.forward(images) # Equation (1): compute gradients with respect to representation # HERE MY DOUBT -> in DomainBed (all_p * all_o) with all_o one-hot label vector all_g = autograd.grad((all_p).sum(), all_f)[0] # Equation (2): compute top-gradient-percentile mask percentiles = np.percentile(all_g.cpu(), self.drop_f, axis=1) percentiles = torch.Tensor(percentiles) percentiles = percentiles.unsqueeze(1).repeat(1, all_g.size(1)) mask_f = all_g.lt(percentiles).float() # Equation (3): mute top-gradient-percentile activations all_f_muted = all_f * mask_f # Equation (4): compute muted predictions all_p_muted = self.model.classifier(all_f_muted) # Section 3.3: Batch Percentage changes = (all_p).sum(1) - (all_p_muted).sum(1) percentile = np.percentile(changes.detach().cpu(), self.drop_b) mask_b = changes.lt(percentile).float().view(-1, 1) mask = torch.logical_or(mask_f, mask_b).float() # Equations (3) and (4) again, this time mutting over examples all_p_muted_again = self.model.classifier(all_f * mask) self.optimizer.zero_grad() lossbce = torch.nn.BCEWithLogitsLoss() loss = lossbce(all_p_muted_again.squeeze(), targets) loss.backward() self.optimizer.step() return loss.item()
— Reply to this email directly, view it on GitHub https://github.com/DeLightCMU/RSC/issues/22#issuecomment-1110657428, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFI42GSL7GEL66S3UWIVGELVHDVTVANCNFSM5P7XJX2A . You are receiving this because you were mentioned.Message ID: @.***>
-- Warm Regards ---Zeyi Huang
Dear authors,
Thank you for your wonderful and interesting work! I have one question about the adaptation of your method on regression problems. When the label space is continuous, like monocular depth estimation, could you please provide some insights on how to modify the current version?
Thanks in advance!