GuYuc / WS-DAN.PyTorch

A PyTorch implementation of WS-DAN (Weakly Supervised Data Augmentation Network) for FGVC (Fine-Grained Visual Classification)
MIT License
404 stars 99 forks source link

About pytorch computation graph #17

Open jayxio opened 4 years ago

jayxio commented 4 years ago

Hi mate,

I'm trying to reproduce experiment results using WS-DAN/Xception and I'm impressed by the implementation of the WS-DAN network.

However, in train-wsdan.py, when I try to iterate dataloader, for i, (X, y) in enumerate(data_loader):, it calls batch_loss.backward().

It shows the following error: ** RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [8]] is at version 4; expected version 3 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

So I print out parameters in "net":

 for name,parameters in net.named_parameters():
    if parameters.size()[0]==8:
        print(name,':',parameters.size())

which shows the so-called "[torch.cuda.FloatTensor [8]]" variables.

module.attentions.conv.weight : torch.Size([8, 2048, 1, 1])
module.attentions.bn.weight : torch.Size([8])
module.attentions.bn.bias : torch.Size([8])

So I find how the attention weights are built at the very beginning:

# Generate Attention Map
if self.training:
   # Randomly choose one of attention maps Ak
   attention_map = []
      for i in range(batch_size):
         # attention_weights = torch.sqrt(attention_maps[i].sum(dim=(1, 2)).detach() + EPSILON)
         attention_weights = torch.sqrt(attention_maps[i].sum(dim=(1, 2)) + EPSILON)
         attention_weights = F.normalize(attention_weights, p=1, dim=0)
         # It block the gradients flow??
         k_index = np.random.choice(self.M, 2, p=attention_weights.cpu().detach().numpy())
         pdb.set_trace()
         attention_map.append(attention_maps[i, k_index, ...])
         attention_map = torch.stack(attention_map)  # (B, 2, H, W) - one for cropping, the other for dropping

So my question is, these parts use NumPy to calculate, so it seems what are we trying to build is actually two separate computation Graphs?

k_index = np.random.choice(self.M, 2, p=attention_weights.cpu().detach().numpy())
pdb.set_trace()
attention_map.append(attention_maps[i, k_index, ...])
attention_map = torch.stack(attention_map) 

Or, should we just use pytorch to implement it? Because the gradient calculation error seems caused by this.

Thx for answering in advance!

GuYuc commented 3 years ago

@jayxio Hi. Actually such error didn't occur in my experiments. This step was just for randomly choosing attention maps by using a random number package, and I believe the computation graph only need to cache these random index once so that backward calculation goes back smoothly on these attention maps. I think python 'random' package, numpy, or any other package is fine for this step.