About pytorch computation graph

Hi mate,

I'm trying to reproduce experiment results using WS-DAN/Xception and I'm impressed by the implementation of the WS-DAN network.

However, in train-wsdan.py, when I try to iterate dataloader, for i, (X, y) in enumerate(data_loader):, it calls batch_loss.backward().

It shows the following error: ** RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [8]] is at version 4; expected version 3 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

So I print out parameters in "net":

 for name,parameters in net.named_parameters():
    if parameters.size()[0]==8:
        print(name,':',parameters.size())

which shows the so-called "[torch.cuda.FloatTensor [8]]" variables.

module.attentions.conv.weight : torch.Size([8, 2048, 1, 1])
module.attentions.bn.weight : torch.Size([8])
module.attentions.bn.bias : torch.Size([8])

So I find how the attention weights are built at the very beginning:

# Generate Attention Map
if self.training:
   # Randomly choose one of attention maps Ak
   attention_map = []
      for i in range(batch_size):
         # attention_weights = torch.sqrt(attention_maps[i].sum(dim=(1, 2)).detach() + EPSILON)
         attention_weights = torch.sqrt(attention_maps[i].sum(dim=(1, 2)) + EPSILON)
         attention_weights = F.normalize(attention_weights, p=1, dim=0)
         # It block the gradients flow??
         k_index = np.random.choice(self.M, 2, p=attention_weights.cpu().detach().numpy())
         pdb.set_trace()
         attention_map.append(attention_maps[i, k_index, ...])
         attention_map = torch.stack(attention_map)  # (B, 2, H, W) - one for cropping, the other for dropping

So my question is, these parts use NumPy to calculate, so it seems what are we trying to build is actually two separate computation Graphs?

k_index = np.random.choice(self.M, 2, p=attention_weights.cpu().detach().numpy())
pdb.set_trace()
attention_map.append(attention_maps[i, k_index, ...])
attention_map = torch.stack(attention_map)

Or, should we just use pytorch to implement it? Because the gradient calculation error seems caused by this.

Thx for answering in advance!

GuYuc / WS-DAN.PyTorch

About pytorch computation graph #17